Program using 500MB RAM to process 5MB file - Haskell-Cafe - Haskell.org

newer
List and description of language...

Program using 500MB RAM to process 5MB file

older
Wishful thinking: a text editor...

lucas＠die.net.au

2 Apr 2009 2 Apr '09

11:06 p.m.

I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language. However, it eats up all my RAM whenever I run the program. http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175 Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be. -- Lucas Hazel

Attachments:

attachment.sig (application/pgp-signature — 197 bytes)

Reply

Sign in to reply online Use email software

Show replies by date

Rick R

2 Apr 2 Apr

11:55 p.m.

You could profile your app for memory usage. Then you could figure out just what function is blowing up the mem usage and figure out how to optimize it. http://book.realworldhaskell.org/read/profiling-and-optimization.html 2009/4/2

I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.

However, it eats up all my RAM whenever I run the program.

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175

Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be.

-- Lucas Hazel

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- We can't solve problems by using the same kind of thinking we used when we created them. - A. Einstein

Reply

Sign in to reply online Use email software

lucas＠die.net.au

3 Apr 3 Apr

1:56 a.m.

On Thu, Apr 02, 2009 at 07:55:07PM -0400, Rick R wrote:

You could profile your app for memory usage. Then you could figure out just what function is blowing up the mem usage and figure out how to optimize it.

http://book.realworldhaskell.org/read/profiling-and-optimization.html

2009/4/2

...
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.

However, it eats up all my RAM whenever I run the program.

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175

Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be.

I ran some profiling as suggested, [SNIP] total time = 8.36 secs (418 ticks @ 20 ms) total alloc = 3,882,593,720 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc line PkgDb 89.7 93.5 COST CENTRE MODULE no. entries %time %alloc %time %alloc line PkgDb 305 109771 89.7 93.3 89.7 93.3 [SNIP] The line function is part of the file parser line :: Parser String line = anyChar `manyTill` newline files' :: Parser Files files' = line `manyTill` newline Perhaps I should also explain the structure of the file. It's for a simple package manager called pkgutils, used for CRUX[1]. The file contains information for all the packages installed and is structured as follows <package name> <package version> <file> <file> ... <file> <package name> ... From profiling it shows that the memory is simple consumed by reading in all the lines, the graph from using -p -hd shows an almost Ologn2 growth of the heap as the collection of lines grows. Is there a better way to do this? [1] http://crux.nu -- Lucas Hazel

Reply

Sign in to reply online Use email software

Jason Dagit

4:48 a.m.

2009/4/2 :

On Thu, Apr 02, 2009 at 07:55:07PM -0400, Rick R wrote:

...
You could profile your app for memory usage. Then you could figure out just what function is blowing up the mem usage and figure out how to optimize it.

http://book.realworldhaskell.org/read/profiling-and-optimization.html

2009/4/2

...
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.

However, it eats up all my RAM whenever I run the program.

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175

Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be.

I ran some profiling as suggested,

[SNIP]

total time = 8.36 secs (418 ticks @ 20 ms) total alloc = 3,882,593,720 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc

line PkgDb 89.7 93.5

COST CENTRE MODULE no. entries %time %alloc %time %alloc line PkgDb 305 109771 89.7 93.3 89.7 93.3

[SNIP]

The line function is part of the file parser

line :: Parser String line = anyChar `manyTill` newline

files' :: Parser Files files' = line `manyTill` newline

Perhaps I should also explain the structure of the file. It's for a simple package manager called pkgutils, used for CRUX[1]. The file contains information for all the packages installed and is structured as follows

<package name> <package version> <file> <file> ... <file>

<package name> ...

From profiling it shows that the memory is simple consumed by reading in all the lines, the graph from using -p -hd shows an almost Ologn2 growth of the heap as the collection of lines grows.

Is there a better way to do this?

In this case the syntax of your file seems pretty simple. Someone else suggested a streaming approach. Putting those ideas together, I defined testInput as follows: testInput = "<package name>\n" ++"<package version>\n" ++"<file>\n" ++"<file>\n" ++"...\n" ++"<file>\n" ++"\n" ++"<package name>\n" ++"...\n" Here is an interactive experiment I tried: GHCi> :t lines lines :: String -> [String] GHCi> lines testInput ["<package name>","<package version>","<file>","<file>","...","<file>","","<package name>","..."] Okay, looks we can use 'lines' from the Prelude to split the input into lines. GHCi> :t cycle cycle :: [a] -> [a] GHCi> :t cycle testInput cycle testInput :: [Char] Using cycle on the testInput like this will give us an infinite input that we can use to see if we have a streaming approach. GHCi> take 10 $ lines $ cycle testInput ["<package name>","<package version>","<file>","<file>","...","<file>","","<package name>","...","<package name>"] Success. So if you like, you could use something like hGetContents that reads a file lazily, run lines over the file and get a lazy stream of the file lines. Then you can use something like takeWhile to take lines until you hit an empty line to build up your parsing functions. You could experiment with bytestrings, but in this particular application I wouldn't expect to see a huge benefit. Here I think you can run in space proportional to the longest list of files that you encounter in the input. Hope that helps, Jason

Reply

Sign in to reply online Use email software

Luke Palmer

2:08 a.m.

2009/4/2

I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.

However, it eats up all my RAM whenever I run the program.

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175

Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be.

(1) You are using plain Strings. Those things are like 8 bytes per character (or something, someone more knowledgeable can give a more accurate figure). Use bytestrings (with bytestring-utf8 if you need it) instead. (2) You are parsing strictly, meaning you have to read the whole input file before anything can be output. This may be necessary for your application, but Haskell is *very* strong with streaming applications. Change to a lazy parser and you will run in constant memory. (I don't know offhand of any lazy parsing libraries, but I've heard them discussed before, so they're somewhere) Luke

Reply

Sign in to reply online Use email software

Sterling Clover

4:28 a.m.

I also suspect that manyTill is a really bad choice, since it doesn't give you anything until the end token. It would be much better if you could rewrite your parser in terms of many and many1. --Sterl On Apr 2, 2009, at 10:08 PM, Luke Palmer wrote:

2009/4/2 I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.

However, it eats up all my RAM whenever I run the program.

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175

Obviously I'm doing something wrong, but without my magical FP pants I don't know what that might be.

(1) You are using plain Strings. Those things are like 8 bytes per character (or something, someone more knowledgeable can give a more accurate figure). Use bytestrings (with bytestring-utf8 if you need it) instead.

(2) You are parsing strictly, meaning you have to read the whole input file before anything can be output. This may be necessary for your application, but Haskell is very strong with streaming applications. Change to a lazy parser and you will run in constant memory.

(I don't know offhand of any lazy parsing libraries, but I've heard them discussed before, so they're somewhere)

Luke

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply

Sign in to reply online Use email software

Malcolm Wallace

9:35 a.m.

...
(2) You are parsing strictly, meaning you have to read the whole input file before anything can be output.

This is likely the main performance problem. I'm guessing you are using parsec. Try switching to polyparse if you want to try out lazy parser combinators instead. (module Text.ParserCombinators.Poly.Lazy) http://www.cs.york.ac.uk/fp/polyparse/ There are other incremental parsers out there too, although they may be more complicated because they solve a larger problem: http://yi-editor.blogspot.com/2008/11/incremental-parsing-in-yi.html

I also suspect that manyTill is a really bad choice, since it doesn't give you anything until the end token.

This is not an "also" - it is the same problem of too much strictness. Polyparse's combinator 'manyFinally' is the corresponding combinator that works lazily if you want it to. Regards, Malcolm

Reply

Sign in to reply online Use email software

Ketil Malde

8:22 a.m.

lucas@die.net.au writes:

I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.

However, it eats up all my RAM whenever I run the program.

This typically happens to me when I parse large files and either am a) using a parser that is too strict (like Luke says) or b) using a parser that is too lazy - or rather, it parses into a lazy data structure which is then populated with unevaluated thunks holding onto the input data. Also, you probably want memory profiling (+RTS -h and friends), not time profiling (+RTS -p). -k -- If I haven't seen further, it is by standing in the footprints of giants

Reply

Sign in to reply online Use email software

lucas＠die.net.au

11:27 a.m.

On Fri, Apr 03, 2009 at 10:22:07AM +0200, Ketil Malde wrote:

lucas@die.net.au writes:

...
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.

...
However, it eats up all my RAM whenever I run the program.

This typically happens to me when I parse large files and either am a) using a parser that is too strict (like Luke says) or b) using a parser that is too lazy - or rather, it parses into a lazy data structure which is then populated with unevaluated thunks holding onto the input data.

Thanks for all the help everyone . I've decided to dump Parsec, as the file structure is simple enough to implement using basic list manipulation (which is, I've read, one of haskell's strong points) and has turned out to be much simpler code. I think I was reading the write your own scheme tutorial when I started writing that code, so natually started using parsec. As a side note, I was reading the I/O section of RWH last night and came across the lazy vs. strict I/O part, however it didn't occur to me that Parsec was strict. Anyway, thanks for all the help and suggestions. -- Lucas Hazel

Reply

Sign in to reply online Use email software

lucas＠die.net.au

12:32 p.m.

On Fri, Apr 03, 2009 at 10:27:08PM +1100, lucas@die.net.au wrote:

On Fri, Apr 03, 2009 at 10:22:07AM +0200, Ketil Malde wrote:

...
lucas@die.net.au writes:

...
I'm relatively new to haskell so as one does, I am rewriting an existing program in haskell to help learn the language.

...
However, it eats up all my RAM whenever I run the program.

Thanks for all the help everyone . I've decided to dump Parsec, as the file structure is simple enough to implement using basic list manipulation (which is, I've read, one of haskell's strong points) and has turned out to be much simpler code.

Using lazy I/O has reduced run time by 75% and RAM consumption to 3MB Thank you :) -- Lucas Hazel

Reply

Sign in to reply online Use email software

5944

Age (days ago)

5945

Last active (days ago)

Download

9 comments

7 participants

tags

participants (7)

Jason Dagit
Ketil Malde
lucas＠die.net.au
Luke Palmer
Malcolm Wallace
Rick R
Sterling Clover