
Hi! I wrote a simple program to parse Debian package list files (a simple text record format), sort the list of packages and output the list. I first wrote the program in C (which I speak fluently), but wanting to learn Haskell, have been trying to reimplement it in Haskell. The problem is that my 60-line Haskell program, which doesn't yet do everything the C program does, uses huge amounts of memory, 1.5 gigabytes when parsing a single file. The C implementation can parse and sort more than twice that data in less than 10 megabytes. I can't figure out where the memory goes or how to fix it, but my guess would be it's some lazy computation thunk. (See below after the code for description of the input and output formats.) With a package file of 28000 records (packages), the memory usage is roughly 1.5 gigabytes when compiled with ghc - under hugs it just bails out quickly after stack overflow - and I'd appreciate help in figuring out 1) Where the memory is actually spent? 2) How would one usually go about figuring this out? and 3) How to fix it? [4) I'd also love style hints and other ideas to make my code more idiomatic] I've tried to randomly add some $! operators to the code (that doesn't feel right :-), but so far to no avail. Here's the code: ------------------------------------------------------------ module Main where import List (sort) import Maybe (fromJust, mapMaybe) -- Parse one line of the format "Field: value". -- Ignore those that start with a space. readField :: String -> Maybe (String,String) readField line = if line == "" || head line == ' ' then Nothing else let (name,':':' ':val) = break (== ':') line in Just (name,val) data Package = Package { name, version :: !String } deriving (Show, Eq, Ord) -- Use sprintPackage for formatted output, show for unformatted sprintPackage :: Package -> String sprintPackage (Package name ver) = name ++ replicate (50 - length name) ' ' ++ ver -- Read one record worth of lines (separated by blank lines) getOneRecordLines [] = ([],[]) getOneRecordLines lines = (hd, if tl == [] then [] else tail tl) where (hd,tl) = break (== "") lines -- Read one record worth of lines and parse, returning (Field, value) -- tuples and the rest of the lines (less the just read record) readRecordFields :: [String] -> ([(String,String)], [String]) readRecordFields lines = (mapMaybe readField rl, rest) where (rl,rest) = getOneRecordLines lines -- Convert a list of (Field,value) tuples to a Package recordFieldsToPackage :: [(String,String)] -> Package recordFieldsToPackage fields = let names = [y | (x,y) <- fields, x=="Package"] vers = [y | (x,y) <- fields, x=="Version"] in case (names,vers) of ([], _) -> error "Package has no name." (a:b:rest, _) -> error ("Package has two names ("++a++","++b++").") ([n], []) -> error ("Package "++n++" has no version.") ([n], a:b:rest) -> error ("Package "++n++" has multiple versions.") ([n], [v]) -> Package n v -- Read one record, returning Just Package if it contained a valid -- package and version, Nothing otherwise. readRecord :: [String] -> (Maybe Package,[String]) readRecord lines = if fields == [] then (Nothing, rest) else (Just (recordFieldsToPackage fields), rest) where (fields,rest) = readRecordFields lines -- Converts the list (stream) of lines to a list of packages readRecords :: [String] -> [Package] readRecords [] = [] readRecords lines = let (rec,rest) = readRecord lines in case rec of Just pkg -> pkg : readRecords rest Nothing -> readRecords rest processFile = unlines . (map sprintPackage) . sort . readRecords . lines main :: IO () main = interact processFile ------------------------------------------------------------ The input (stdin) for the program is of the format: ------------------------------------------------------------ Package: somepackagename Version: someversionstring Other-Fields: ... Whatever: ... ... (lines that begin with space are just ignored) Package: pkg2 Version: otherversion ... ------------------------------------------------------------ That is, the file has records separated by blank lines. Only lines that begin with "Package: " or "Version: " are considered in a record, the others are just ignored. When finished, it outputs a formatted list of packages to stdout: ------------------------------------------------------------ a2ps 1:4.14-1 a2ps-perl-ja 1.45-5 a56 1.3-5 a7xpg 0.11.dfsg1-4 a7xpg-data 0.11.dfsg1-4 aa3d 1.0-8 aajm 0.4-3 aap 1.091-1 aap-doc 1.091-1 ... ------------------------------------------------------------ GHC profiling hints that the memory is spent in processFile and main, but no amount of adding $! to those functions seems to help: ------------------------------------------------------------ Thu Mar 18 22:37 2010 Time and Allocation Profiling Report (Final) aptlistsh +RTS -p -RTS total time = 3.18 secs (159 ticks @ 20 ms) total alloc = 1,873,922,000 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc processFile Main 44.0 61.9 main Main 32.1 20.7 readField Main 11.9 9.0 recordFieldsToPackage Main 4.4 2.0 newPkg Main 3.1 0.8 sprintPackage Main 1.3 2.1 readRecordFields Main 1.3 1.3 CAF Main 1.3 0.4 getOneRecordLines Main 0.6 1.4 ------------------------------------------------------------ Sami