
Hi! I wrote a simple program to parse Debian package list files (a simple text record format), sort the list of packages and output the list. I first wrote the program in C (which I speak fluently), but wanting to learn Haskell, have been trying to reimplement it in Haskell. The problem is that my 60-line Haskell program, which doesn't yet do everything the C program does, uses huge amounts of memory, 1.5 gigabytes when parsing a single file. The C implementation can parse and sort more than twice that data in less than 10 megabytes. I can't figure out where the memory goes or how to fix it, but my guess would be it's some lazy computation thunk. (See below after the code for description of the input and output formats.) With a package file of 28000 records (packages), the memory usage is roughly 1.5 gigabytes when compiled with ghc - under hugs it just bails out quickly after stack overflow - and I'd appreciate help in figuring out 1) Where the memory is actually spent? 2) How would one usually go about figuring this out? and 3) How to fix it? [4) I'd also love style hints and other ideas to make my code more idiomatic] I've tried to randomly add some $! operators to the code (that doesn't feel right :-), but so far to no avail. Here's the code: ------------------------------------------------------------ module Main where import List (sort) import Maybe (fromJust, mapMaybe) -- Parse one line of the format "Field: value". -- Ignore those that start with a space. readField :: String -> Maybe (String,String) readField line = if line == "" || head line == ' ' then Nothing else let (name,':':' ':val) = break (== ':') line in Just (name,val) data Package = Package { name, version :: !String } deriving (Show, Eq, Ord) -- Use sprintPackage for formatted output, show for unformatted sprintPackage :: Package -> String sprintPackage (Package name ver) = name ++ replicate (50 - length name) ' ' ++ ver -- Read one record worth of lines (separated by blank lines) getOneRecordLines [] = ([],[]) getOneRecordLines lines = (hd, if tl == [] then [] else tail tl) where (hd,tl) = break (== "") lines -- Read one record worth of lines and parse, returning (Field, value) -- tuples and the rest of the lines (less the just read record) readRecordFields :: [String] -> ([(String,String)], [String]) readRecordFields lines = (mapMaybe readField rl, rest) where (rl,rest) = getOneRecordLines lines -- Convert a list of (Field,value) tuples to a Package recordFieldsToPackage :: [(String,String)] -> Package recordFieldsToPackage fields = let names = [y | (x,y) <- fields, x=="Package"] vers = [y | (x,y) <- fields, x=="Version"] in case (names,vers) of ([], _) -> error "Package has no name." (a:b:rest, _) -> error ("Package has two names ("++a++","++b++").") ([n], []) -> error ("Package "++n++" has no version.") ([n], a:b:rest) -> error ("Package "++n++" has multiple versions.") ([n], [v]) -> Package n v -- Read one record, returning Just Package if it contained a valid -- package and version, Nothing otherwise. readRecord :: [String] -> (Maybe Package,[String]) readRecord lines = if fields == [] then (Nothing, rest) else (Just (recordFieldsToPackage fields), rest) where (fields,rest) = readRecordFields lines -- Converts the list (stream) of lines to a list of packages readRecords :: [String] -> [Package] readRecords [] = [] readRecords lines = let (rec,rest) = readRecord lines in case rec of Just pkg -> pkg : readRecords rest Nothing -> readRecords rest processFile = unlines . (map sprintPackage) . sort . readRecords . lines main :: IO () main = interact processFile ------------------------------------------------------------ The input (stdin) for the program is of the format: ------------------------------------------------------------ Package: somepackagename Version: someversionstring Other-Fields: ... Whatever: ... ... (lines that begin with space are just ignored) Package: pkg2 Version: otherversion ... ------------------------------------------------------------ That is, the file has records separated by blank lines. Only lines that begin with "Package: " or "Version: " are considered in a record, the others are just ignored. When finished, it outputs a formatted list of packages to stdout: ------------------------------------------------------------ a2ps 1:4.14-1 a2ps-perl-ja 1.45-5 a56 1.3-5 a7xpg 0.11.dfsg1-4 a7xpg-data 0.11.dfsg1-4 aa3d 1.0-8 aajm 0.4-3 aap 1.091-1 aap-doc 1.091-1 ... ------------------------------------------------------------ GHC profiling hints that the memory is spent in processFile and main, but no amount of adding $! to those functions seems to help: ------------------------------------------------------------ Thu Mar 18 22:37 2010 Time and Allocation Profiling Report (Final) aptlistsh +RTS -p -RTS total time = 3.18 secs (159 ticks @ 20 ms) total alloc = 1,873,922,000 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc processFile Main 44.0 61.9 main Main 32.1 20.7 readField Main 11.9 9.0 recordFieldsToPackage Main 4.4 2.0 newPkg Main 3.1 0.8 sprintPackage Main 1.3 2.1 readRecordFields Main 1.3 1.3 CAF Main 1.3 0.4 getOneRecordLines Main 0.6 1.4 ------------------------------------------------------------ Sami

Sami,
I'm no profiling expert, but I have a few questions though:
- What is the size (in bytes) of your input file?
- Also, does memory usage improve if you remove the "sort"?
Patrick
On Sun, Mar 21, 2010 at 7:23 PM, Sami Liedes
Hi!
I wrote a simple program to parse Debian package list files (a simple text record format), sort the list of packages and output the list. I first wrote the program in C (which I speak fluently), but wanting to learn Haskell, have been trying to reimplement it in Haskell.
The problem is that my 60-line Haskell program, which doesn't yet do everything the C program does, uses huge amounts of memory, 1.5 gigabytes when parsing a single file. The C implementation can parse and sort more than twice that data in less than 10 megabytes. I can't figure out where the memory goes or how to fix it, but my guess would be it's some lazy computation thunk.
(See below after the code for description of the input and output formats.)
With a package file of 28000 records (packages), the memory usage is roughly 1.5 gigabytes when compiled with ghc - under hugs it just bails out quickly after stack overflow - and I'd appreciate help in figuring out
1) Where the memory is actually spent? 2) How would one usually go about figuring this out?
and
3) How to fix it? [4) I'd also love style hints and other ideas to make my code more idiomatic]
I've tried to randomly add some $! operators to the code (that doesn't feel right :-), but so far to no avail.
Here's the code:
------------------------------------------------------------ module Main where import List (sort) import Maybe (fromJust, mapMaybe)
-- Parse one line of the format "Field: value". -- Ignore those that start with a space. readField :: String -> Maybe (String,String) readField line = if line == "" || head line == ' ' then Nothing else let (name,':':' ':val) = break (== ':') line in Just (name,val)
data Package = Package { name, version :: !String } deriving (Show, Eq, Ord)
-- Use sprintPackage for formatted output, show for unformatted sprintPackage :: Package -> String sprintPackage (Package name ver) = name ++ replicate (50 - length name) ' ' ++ ver
-- Read one record worth of lines (separated by blank lines) getOneRecordLines [] = ([],[]) getOneRecordLines lines = (hd, if tl == [] then [] else tail tl) where (hd,tl) = break (== "") lines
-- Read one record worth of lines and parse, returning (Field, value) -- tuples and the rest of the lines (less the just read record) readRecordFields :: [String] -> ([(String,String)], [String]) readRecordFields lines = (mapMaybe readField rl, rest) where (rl,rest) = getOneRecordLines lines
-- Convert a list of (Field,value) tuples to a Package recordFieldsToPackage :: [(String,String)] -> Package recordFieldsToPackage fields = let names = [y | (x,y) <- fields, x=="Package"] vers = [y | (x,y) <- fields, x=="Version"] in case (names,vers) of ([], _) -> error "Package has no name." (a:b:rest, _) -> error ("Package has two names ("++a++","++b++").") ([n], []) -> error ("Package "++n++" has no version.") ([n], a:b:rest) -> error ("Package "++n++" has multiple versions.") ([n], [v]) -> Package n v
-- Read one record, returning Just Package if it contained a valid -- package and version, Nothing otherwise. readRecord :: [String] -> (Maybe Package,[String]) readRecord lines = if fields == [] then (Nothing, rest) else (Just (recordFieldsToPackage fields), rest) where (fields,rest) = readRecordFields lines
-- Converts the list (stream) of lines to a list of packages readRecords :: [String] -> [Package] readRecords [] = [] readRecords lines = let (rec,rest) = readRecord lines in case rec of Just pkg -> pkg : readRecords rest Nothing -> readRecords rest
processFile = unlines . (map sprintPackage) . sort . readRecords . lines
main :: IO () main = interact processFile ------------------------------------------------------------
The input (stdin) for the program is of the format:
------------------------------------------------------------ Package: somepackagename Version: someversionstring Other-Fields: ... Whatever: ... ... (lines that begin with space are just ignored)
Package: pkg2 Version: otherversion ... ------------------------------------------------------------
That is, the file has records separated by blank lines. Only lines that begin with "Package: " or "Version: " are considered in a record, the others are just ignored.
When finished, it outputs a formatted list of packages to stdout:
------------------------------------------------------------ a2ps 1:4.14-1 a2ps-perl-ja 1.45-5 a56 1.3-5 a7xpg 0.11.dfsg1-4 a7xpg-data 0.11.dfsg1-4 aa3d 1.0-8 aajm 0.4-3 aap 1.091-1 aap-doc 1.091-1 ... ------------------------------------------------------------
GHC profiling hints that the memory is spent in processFile and main, but no amount of adding $! to those functions seems to help:
------------------------------------------------------------ Thu Mar 18 22:37 2010 Time and Allocation Profiling Report (Final)
aptlistsh +RTS -p -RTS
total time = 3.18 secs (159 ticks @ 20 ms) total alloc = 1,873,922,000 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
processFile Main 44.0 61.9 main Main 32.1 20.7 readField Main 11.9 9.0 recordFieldsToPackage Main 4.4 2.0 newPkg Main 3.1 0.8 sprintPackage Main 1.3 2.1 readRecordFields Main 1.3 1.3 CAF Main 1.3 0.4 getOneRecordLines Main 0.6 1.4 ------------------------------------------------------------
Sami
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAEBCAAGBQJLpqppAAoJEKLT589SE0a08FoQALtZf+eO7u+YdVJuvZ6xxg/y DVHTnRdemPtlNuBJazV1RlTQovRNyQpX1CHX70MdnHLP4g9CzQhBsIeGTdgA9am+ pd9NjNFc1+ZDJJWAY24Kyx0pEfvyfLuUvF8KNInZGXeg+BriiFjFFU3RHqY0F3XD LIjAq/TaABmMGgahvG/y0EMRhduC+N9VskA4ivd7OCxvA7bGuzZEFxFbz2qqXrcE VXzE4ioha8C+WeX6djm/+CSoqN8o7bN1DS+fkEYr+jlU2jYG/xvnnfC8785r+9iP wZo1muIP4FVExZ3w5VSjLWkhMjgYDHLiuEs1S5dQdrgAHz+ea+PZqFyOoDEMyYvG vhnaTrPev2O4BZLAky5F13tj7fm1tV5CxJ9oq7fuurZoB7OdEpSguOyR3xYgoIYv spg7aqRlLuruX7VHbSu6pTtqWCEvxHxQbd+R2f5jqaDOsw+3n1enH6PFqTkiLpHc ReTLli4ploUGmiQRmbWpe2urECQ6QAXhqdx93vHwT2TUYVqL9wH84XsNvGBSt8Kq 5OowrJOlneeWre0SS1dn96ASE1oKKiCpc9rkGvD3tMxTONHFTPziyMtiC675ReH0 9XHvRkn9+8RV2gVnKeOMyfPT3ef/q2aP8uESf4nW6AAHNa41QWiB9/OYS32ljx/9 Cv/IvQVVIzSYQjNTO7a2 =yn0D -----END PGP SIGNATURE-----
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
-- ===================== Patrick LeBoutillier Rosemère, Québec, Canada

On Sun, Mar 21, 2010 at 08:28:23PM -0400, Patrick LeBoutillier wrote:
I'm no profiling expert, but I have a few questions though:
- What is the size (in bytes) of your input file?
The input file contains 31,129,639 bytes, with 710,355 lines and 27,954 individual packages/records. In fact it seems that even if the input stream repeats only the line " a" (that is, a space and the letter a) infinitely, the program eventually eats all memory. That's interesting. So I guess the lines are read in, but then held in memory and lazily not processed further until the entire file has been read, because it's not necessary? Or something. I tried using $! in readRecordFields like readRecordFields lines = (mapMaybe (readField $!) rl, rest) where (rl,rest) = getOneRecordLines lines but that didn't help either... I guess I don't fully understand $!. Does it force the entire computation below it to finish? To achieve sane memory behavior, it would seem necessary to parse the lines before they're all read, and then mapMaybe in readRecordFields would throw out the Nothings. After that I believe it should all be constant amount of memory with all lines starting with a space.
- Also, does memory usage improve if you remove the "sort"?
Yes. Then it only takes a few megabytes, regardless of how large the file is. Thanks, Sami

On Mon, Mar 22, 2010 at 03:06:08AM +0200, Sami Liedes wrote:
On Sun, Mar 21, 2010 at 08:28:23PM -0400, Patrick LeBoutillier wrote:
- Also, does memory usage improve if you remove the "sort"?
Yes. Then it only takes a few megabytes, regardless of how large the file is.
This led me to investigate further. The following hack (and nothing less crude I could invent) seems to significantly lessen the amount of memory needed by the program. Now it takes maybe 180 megabytes instead of 1.5 gigabytes. I guess the 180 megabytes consists of the output stream (string), which is now explicitly computed entirely before sorting and outputting it. It's still far from perfect. I think there definitely _should_ be a better way of doing this (forcing some computation to be strict so it takes less memory). Any ideas? And I especially despise the (sum (map (const 0) b)) thing, but it seems nothing less ugly does the trick. It seems seq doesn't force the entire computation to finish, only the next level. Is there some way to force the entire computation subtree to finish? That way I could say just seq' b (if seq' was the function that does that). (see my original posting for the entire code) ------------------------------------------------------------ -- Return the unsorted package list and the sorted package list processFile :: String -> (String,String) processFile str = let l = lines str pkgs = readRecords l in (unlines (map sprintPackage pkgs), unlines (map sprintPackage (sort pkgs))) -- Force the unsorted pkg list to be computed first (outputting it also -- does the trick, but mere (map id b) doesn't). This is really ugly. -- A mere "do putStrLn c" however eats 1.5 GB of memory instead of 180 MB. main = do a <- getContents let (b,c) = processFile a in seq (sum (map (const 0) b)) ( do putStrLn c) ------------------------------------------------------------ Sami

On Mon, 22 Mar 2010 20:58:29 +0100, Sami Liedes
It seems seq doesn't force the entire computation to finish, only the next level. Is there some way to force the entire computation subtree to finish? That way I could say just seq' b (if seq' was the function that does that).
You can use Control.Parallel.Strategies.rnf (Reduce to Normal Form) for that. Met vriendelijke groet, Henk-Jan van Tuyl -- http://Van.Tuyl.eu/ http://members.chello.nl/hjgtuyl/tourdemonad.html --

Sami Liedes wrote:
Patrick LeBoutillier wrote:
I'm no profiling expert, but I have a few questions though:
- What is the size (in bytes) of your input file?
The input file contains 31,129,639 bytes, with 710,355 lines and 27,954 individual packages/records.
In fact it seems that even if the input stream repeats only the line " a" (that is, a space and the letter a) infinitely, the program eventually eats all memory. That's interesting.
So I guess the lines are read in, but then held in memory and lazily not processed further until the entire file has been read, because it's not necessary? Or something.
Yep, pretty much. Basically, readRecords returns a list of unevaluated expressions of type Package . But because they haven't been fully evaluated yet, they still contain a lot of references to the input stream. This wouldn't be a problem if the list itself were unevaluated, but it appears that sort is forcing the spine of the list much earlier than its elements, which means that you've read the whole file into memory without also parsing each package into a Package record. This way, you retain way too much of the input file. A quick fix would be to intersperse a function between sort and readRecords that imposes proper evaluation order: import Control.Parallel.Strategies -- or Control.DeepSeq instance NFData Package where rnf (Package n v) = rnf n `seq` rnf v `seq` () processFile = unlines . map sprintPackage . sort . strictify . readRecords . lines where strictify [] = [] strictify (x:xs) = rnf x `seq` (x : strictify xs) Here strictify makes sure that each cons cell comes with a fully evaluated element x while the list itself is still lazy. (Since sort has to evaluate the list anyway, the latter doesn't really matter and you could as well use strictify xs = rnf xs `seq` xs ) While I expect the above to work, I hesitate to claim that it actually does. The reason is that I don't understand your readRecords function at a glance, unlike the processFile pipeline, it is not apparent how it works. Since finding and fixing space leaks is easier if your code is more obvious, I recommend to reformulate it as a function composition as well, for instance something like this: import Control.Arrow ((&&&)) import Data.Maybe (catMaybes) import Daya.List (stripPrefix) readRecords = map readRecord . splitByEmptyLines splitByEmptyLines = filter (not . null) . groupBy (\x y -> y /= "") readRecord = uncurry Package . (prefix "Name: " &&& prefix "Version: ") . filter (not . beginsWithSpace) where beginsWithSpace (' ':xs) = True beginsWithSpace _ = False -- be warned, head may raise an exception prefix s = head . catMaybes . map (stripPrefix s) The rule is that explicit recursion should be replaced by high level recursion schemes like map , filter , fold , concat etc. whenever applicable. This way, it's much easier to see how much of the input is retained for each Package . For instance, your original readRecord function could read an arbitrary number of lines, while here, it's clear that readRecord can't cross the "boundary" imposed by splitByEmptyLines . Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

On Wed, Mar 24, 2010 at 11:23:05AM +0100, Heinrich Apfelmus wrote:
A quick fix would be to intersperse a function between sort and readRecords that imposes proper evaluation order: [...] While I expect the above to work, I hesitate to claim that it actually does. The reason is that I don't understand your readRecords function at a glance, unlike the processFile pipeline, it is not apparent how it works. Since finding and fixing space leaks is easier if your code is more obvious, I recommend to reformulate it as a function composition as well, for instance something like this: [...]
Thanks, that was enlightening! With your version of readRecords the program takes as much memory as with mine (it's still more than 10x what the C version takes, but perhaps that's the cost of using Haskell). But your version taught me a lot. Exactly the kind of feedback I needed :) Sami

Sami Liedes wrote:
On Wed, Mar 24, 2010 at 11:23:05AM +0100, Heinrich Apfelmus wrote:
A quick fix would be to intersperse a function between sort and readRecords that imposes proper evaluation order: [...] While I expect the above to work, I hesitate to claim that it actually does. The reason is that I don't understand your readRecords function at a glance, unlike the processFile pipeline, it is not apparent how it works. Since finding and fixing space leaks is easier if your code is more obvious, I recommend to reformulate it as a function composition as well, for instance something like this: [...]
Thanks, that was enlightening! With your version of readRecords the program takes as much memory as with mine (it's still more than 10x what the C version takes, but perhaps that's the cost of using Haskell). But your version taught me a lot. Exactly the kind of feedback I needed :)
My pleasure. :) I should point out that my version of readRecords still needs to be composed with the strictify function, for the same reasons as yours. But it shouldn't be very difficult to bake the evaluation order into readRecords now if desired. Also note that my readRecord has introduced another potential space leak. But thanks to the clarified structure, it is easy to spot. Namely, the function prefix "Name: " &&& prefix "Version: " is problematic. Expanding the (&&&) combinator, this is equal to \xs -> (prefix "Name: " xs, prefix "Version: " xs) which means that the xs is used twice and thus potentially leaky. After all, the first component might evaluate xs in full, which will then leak around until the second component consumes it. This is exactly what will happen if you feed it a patological input file consisting of just a single but huge record along the lines of Version: xyz Name: haskell Foo: bar Foo: bar ... ... Foo: Bar If this turns out to be a major problem, you can simply rewrite the offending function to evaluate the components "in parallel". The partition function in the Haskell Prelude demonstrates how to do this. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com
participants (4)
-
Heinrich Apfelmus
-
Henk-Jan van Tuyl
-
Patrick LeBoutillier
-
Sami Liedes