Program using 500MB RAM to process 5MB file

I'm relatively new to haskell so as one does, I am rewriting an
existing program in haskell to help learn the language.
However, it eats up all my RAM whenever I run the program.
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175
Obviously I'm doing something wrong, but without my magical FP pants I
don't know what that might be.
--
Lucas Hazel

You could profile your app for memory usage. Then you could figure out just
what function is blowing up the mem usage and figure out how to optimize it.
http://book.realworldhaskell.org/read/profiling-and-optimization.html
2009/4/2
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.
However, it eats up all my RAM whenever I run the program.
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175
Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be.
-- Lucas Hazel
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- We can't solve problems by using the same kind of thinking we used when we created them. - A. Einstein

On Thu, Apr 02, 2009 at 07:55:07PM -0400, Rick R wrote:
You could profile your app for memory usage. Then you could figure out just what function is blowing up the mem usage and figure out how to optimize it.
http://book.realworldhaskell.org/read/profiling-and-optimization.html
2009/4/2
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.
However, it eats up all my RAM whenever I run the program.
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175
Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be.
I ran some profiling as suggested,
[SNIP]
total time = 8.36 secs (418 ticks @ 20 ms)
total alloc = 3,882,593,720 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
line PkgDb 89.7 93.5
COST CENTRE MODULE no. entries %time %alloc %time %alloc
line PkgDb 305 109771 89.7 93.3 89.7 93.3
[SNIP]
The line function is part of the file parser
line :: Parser String
line = anyChar `manyTill` newline
files' :: Parser Files
files' = line `manyTill` newline
Perhaps I should also explain the structure of the file. It's for a
simple package manager called pkgutils, used for CRUX[1]. The file
contains information for all the packages installed and is structured
as follows
<package name>
<package version>
<file>
<file>
...
<file>
<package name>
...
From profiling it shows that the memory is simple consumed by reading
in all the lines, the graph from using -p -hd shows an almost Ologn2
growth of the heap as the collection of lines grows.
Is there a better way to do this?
[1] http://crux.nu
--
Lucas Hazel

2009/4/2
On Thu, Apr 02, 2009 at 07:55:07PM -0400, Rick R wrote:
You could profile your app for memory usage. Then you could figure out just what function is blowing up the mem usage and figure out how to optimize it.
http://book.realworldhaskell.org/read/profiling-and-optimization.html
2009/4/2
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.
However, it eats up all my RAM whenever I run the program.
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175
Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be.
I ran some profiling as suggested,
[SNIP]
total time = 8.36 secs (418 ticks @ 20 ms) total alloc = 3,882,593,720 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
line PkgDb 89.7 93.5
COST CENTRE MODULE no. entries %time %alloc %time %alloc line PkgDb 305 109771 89.7 93.3 89.7 93.3
[SNIP]
The line function is part of the file parser
line :: Parser String line = anyChar `manyTill` newline
files' :: Parser Files files' = line `manyTill` newline
Perhaps I should also explain the structure of the file. It's for a simple package manager called pkgutils, used for CRUX[1]. The file contains information for all the packages installed and is structured as follows
<package name> <package version> <file> <file> ... <file>
<package name> ...
From profiling it shows that the memory is simple consumed by reading in all the lines, the graph from using -p -hd shows an almost Ologn2 growth of the heap as the collection of lines grows.
Is there a better way to do this?
In this case the syntax of your file seems pretty simple. Someone else suggested a streaming approach. Putting those ideas together, I defined testInput as follows: testInput = "<package name>\n" ++"<package version>\n" ++"<file>\n" ++"<file>\n" ++"...\n" ++"<file>\n" ++"\n" ++"<package name>\n" ++"...\n" Here is an interactive experiment I tried: GHCi> :t lines lines :: String -> [String] GHCi> lines testInput ["<package name>","<package version>","<file>","<file>","...","<file>","","<package name>","..."] Okay, looks we can use 'lines' from the Prelude to split the input into lines. GHCi> :t cycle cycle :: [a] -> [a] GHCi> :t cycle testInput cycle testInput :: [Char] Using cycle on the testInput like this will give us an infinite input that we can use to see if we have a streaming approach. GHCi> take 10 $ lines $ cycle testInput ["<package name>","<package version>","<file>","<file>","...","<file>","","<package name>","...","<package name>"] Success. So if you like, you could use something like hGetContents that reads a file lazily, run lines over the file and get a lazy stream of the file lines. Then you can use something like takeWhile to take lines until you hit an empty line to build up your parsing functions. You could experiment with bytestrings, but in this particular application I wouldn't expect to see a huge benefit. Here I think you can run in space proportional to the longest list of files that you encounter in the input. Hope that helps, Jason

2009/4/2
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.
However, it eats up all my RAM whenever I run the program.
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175
Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be.
(1) You are using plain Strings. Those things are like 8 bytes per character (or something, someone more knowledgeable can give a more accurate figure). Use bytestrings (with bytestring-utf8 if you need it) instead. (2) You are parsing strictly, meaning you have to read the whole input file before anything can be output. This may be necessary for your application, but Haskell is *very* strong with streaming applications. Change to a lazy parser and you will run in constant memory. (I don't know offhand of any lazy parsing libraries, but I've heard them discussed before, so they're somewhere) Luke

I also suspect that manyTill is a really bad choice, since it doesn't give you anything until the end token. It would be much better if you could rewrite your parser in terms of many and many1. --Sterl On Apr 2, 2009, at 10:08 PM, Luke Palmer wrote:
2009/4/2
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language. However, it eats up all my RAM whenever I run the program.
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175
Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be.
(1) You are using plain Strings. Those things are like 8 bytes per character (or something, someone more knowledgeable can give a more accurate figure). Use bytestrings (with bytestring-utf8 if you need it) instead.
(2) You are parsing strictly, meaning you have to read the whole input file before anything can be output. This may be necessary for your application, but Haskell is very strong with streaming applications. Change to a lazy parser and you will run in constant memory.
(I don't know offhand of any lazy parsing libraries, but I've heard them discussed before, so they're somewhere)
Luke
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

(2) You are parsing strictly, meaning you have to read the whole input file before anything can be output.
This is likely the main performance problem. I'm guessing you are using parsec. Try switching to polyparse if you want to try out lazy parser combinators instead. (module Text.ParserCombinators.Poly.Lazy) http://www.cs.york.ac.uk/fp/polyparse/ There are other incremental parsers out there too, although they may be more complicated because they solve a larger problem: http://yi-editor.blogspot.com/2008/11/incremental-parsing-in-yi.html
I also suspect that manyTill is a really bad choice, since it doesn't give you anything until the end token.
This is not an "also" - it is the same problem of too much strictness. Polyparse's combinator 'manyFinally' is the corresponding combinator that works lazily if you want it to. Regards, Malcolm

lucas@die.net.au writes:
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.
However, it eats up all my RAM whenever I run the program.
This typically happens to me when I parse large files and either am a) using a parser that is too strict (like Luke says) or b) using a parser that is too lazy - or rather, it parses into a lazy data structure which is then populated with unevaluated thunks holding onto the input data. Also, you probably want memory profiling (+RTS -h and friends), not time profiling (+RTS -p). -k -- If I haven't seen further, it is by standing in the footprints of giants

On Fri, Apr 03, 2009 at 10:22:07AM +0200, Ketil Malde wrote:
lucas@die.net.au writes:
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.
However, it eats up all my RAM whenever I run the program.
This typically happens to me when I parse large files and either am a) using a parser that is too strict (like Luke says) or b) using a parser that is too lazy - or rather, it parses into a lazy data structure which is then populated with unevaluated thunks holding onto the input data.
Thanks for all the help everyone . I've decided to dump Parsec, as the file
structure is simple enough to implement using basic list manipulation
(which is, I've read, one of haskell's strong points) and has turned
out to be much simpler code.
I think I was reading the write your own scheme tutorial when I
started writing that code, so natually started using parsec.
As a side note, I was reading the I/O section of RWH last night and
came across the lazy vs. strict I/O part, however it didn't occur to
me that Parsec was strict.
Anyway, thanks for all the help and suggestions.
--
Lucas Hazel

On Fri, Apr 03, 2009 at 10:27:08PM +1100, lucas@die.net.au wrote:
On Fri, Apr 03, 2009 at 10:22:07AM +0200, Ketil Malde wrote:
lucas@die.net.au writes:
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.
However, it eats up all my RAM whenever I run the program.
Thanks for all the help everyone . I've decided to dump Parsec, as the file structure is simple enough to implement using basic list manipulation (which is, I've read, one of haskell's strong points) and has turned out to be much simpler code.
Using lazy I/O has reduced run time by 75% and RAM consumption to 3MB
Thank you :)
--
Lucas Hazel
participants (7)
-
Jason Dagit
-
Ketil Malde
-
lucas@die.net.au
-
Luke Palmer
-
Malcolm Wallace
-
Rick R
-
Sterling Clover