Re: [Haskell-cafe] Program using 500MB RAM to process 5MB file

3 Apr 2009

      2009/4/2  :
...
On Thu, Apr 02, 2009 at 07:55:07PM -0400, Rick R wrote:
...
You could profile your app for memory usage. Then you could figure out just
what function is blowing up the mem usage and figure out how to optimize it.
http://book.realworldhaskell.org/read/profiling-and-optimization.html
2009/4/2 
...
I'm relatively new to haskell so as one does, I am rewriting an
existing program in haskell to help learn the language.
However, it eats up all my RAM whenever I run the program.
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175
Obviously I'm doing something wrong, but without my magical FP pants I
don't know what that might be.
I ran some profiling as suggested,
[SNIP]
total time  =        8.36 secs   (418 ticks @ 20 ms)
total alloc = 3,882,593,720 bytes  (excludes profiling overheads)
COST CENTRE                    MODULE               %time %alloc
line                           PkgDb                 89.7   93.5
COST CENTRE MODULE no. entries %time %alloc %time %alloc
line        PkgDb  305 109771  89.7  93.3   89.7  93.3
[SNIP]
The line function is part of the file parser
line :: Parser String
line = anyChar `manyTill` newline
files' :: Parser Files
files' = line `manyTill` newline
Perhaps I should also explain the structure of the file. It's for a
simple package manager called pkgutils, used for CRUX[1]. The file
contains information for all the packages installed and is structured
as follows
<package name>
<package version>
<file>
<file>
...
<file>
<package name>
...
From profiling it shows that the memory is simple consumed by reading
in all the lines, the graph from using -p -hd shows an almost Ologn2
growth of the heap as the collection of lines grows.
Is there a better way to do this?
In this case the syntax of your file seems pretty simple.  Someone
else suggested a streaming approach.  Putting those ideas together, I
defined testInput as follows:

testInput = "<package name>\n"
          ++"<package version>\n"
          ++"<file>\n"
          ++"<file>\n"
          ++"...\n"
          ++"<file>\n"
          ++"\n"
          ++"<package name>\n"
          ++"...\n"

Here is an interactive experiment I tried:
GHCi> :t lines
lines :: String -> [String]
GHCi> lines testInput
["<package name>","<package
version>","<file>","<file>","...","<file>","","<package name>","..."]

Okay, looks we can use 'lines' from the Prelude to split the input into lines.

GHCi> :t cycle
cycle :: [a] -> [a]
GHCi> :t cycle testInput
cycle testInput :: [Char]

Using cycle on the testInput like this will give us an infinite input
that we can use to see if we have a streaming approach.

GHCi> take 10 $ lines $ cycle testInput
["<package name>","<package
version>","<file>","<file>","...","<file>","","<package
name>","...","<package name>"]

Success.  So if you like, you could use something like hGetContents
that reads a file lazily, run lines over the file and get a lazy
stream of the file lines.  Then you can use something like takeWhile
to take lines until you hit an empty line to build up your parsing
functions.

You could experiment with bytestrings, but in this particular
application I wouldn't expect to see a huge benefit.  Here I think you
can run in space proportional to the longest list of files that you
encounter in the input.

Hope that helps,
Jason

Re: [Haskell-cafe] Program using 500MB RAM to process 5MB file

Jason Dagit