High memory usage with 1.4 Million records?

Hi Cafe, I'm working on inspecting some data that I'm trying to represent as records in Haskell and seeing about twice the memory footprint than I was expecting. I've got roughly 1.4 million records in a CSV file (400M on disk) that I parse in using bytestring-csv. bytestring-csv returns a [[ByteString]] (wrapped in `type`s) which I then convert into a list of records that have the following structure:
3 Int 1 Text Length 3 1 Text Length 11 12 Float 1 UTCTime
All fields are marked strict and have {-# UNPACK #-} pragmas (I'm guessing that doesn't do anything for non primitives). (Side note, is there a way to check if things are actually being unpacked?) My back of the napkin memory estimates based on the assumption that nothing is being unpacked (and my very spotty understanding of Haskell data structures): Platform: 64 Bit Linux # Type (Sizeof type (occasionally a guess)) 3 * Int (8) 14 * Char (4) -- Text is some kind of bytestring but I'm guessing it can't be worse than the same number of Char? 12 * Float (4) 18 * sizeOf (ptr) (8) UTC: -- From what I can gather through :info in ghci 4 * (ptr) (8) 2 * Integer (16) -- Shouldn't be overly large, times are within 2012 List: (Pointer to element and next cons cell) 1408113 * 8 * 2 = 2513G + 21.5M So even if the original bytestring file is being kept entirely in memory somehow that's not more than 3G. I've written a small driver test program that just parses the CSV, finds the minimum value for a couple of the Float fields, and exits. In the process monitor the memory usage is 6.9G before the program exits. I've tried profiling with +RTS -hc but it ran for >3 hours without finishing, it normally finishes within 4 minutes. Anyone have any ideas for me? Things to try? Thanks, Andrew

* Andrew Myers
I've written a small driver test program that just parses the CSV, finds the minimum value for a couple of the Float fields, and exits. In the process monitor the memory usage is 6.9G before the program exits. I've tried profiling with +RTS -hc but it ran for >3 hours without finishing, it normally finishes within 4 minutes. Anyone have any ideas for me? Things to try?
It's possible that you have problems with laziness (memory being occupied by thunks). To quickly confirm this, try executing a non-profiled build with +RTS -hT. If it takes too long to complete, just abort it in the middle -- the profiling data will be written anyway (unless you kill it in a too violent way). -- Roman I. Cheplyaka :: http://ro-che.info/

On 8 June 2012 01:39, Andrew Myers
Hi Cafe, I'm working on inspecting some data that I'm trying to represent as records in Haskell and seeing about twice the memory footprint than I was expecting.
That is to be expected in a garbage-collected language. If your program requires X bytes of memory then allocators will usually trigger garbage collection once the heap reaches a size of 2X bytes. If it didn't do this then every allocation would require a GC. You can change this factor with +RTS -F option. E.g., +RTS -F1.5 should reduce this to only 50% overhead, but will trigger more frequent garbace collections. To find the actual residency (live data) see the output of +RTS -s There may still be room for improvement. For example, you could try turning on the compacting GC -- which trades GC performance for lower memory usage. You can enable it with +RTS -c The reason that -hc runs slowly is that it performs a GC every 1s (I think). You can change this using the -i option. E.g. -i60 only examines the heap every 60s. It will touch almost all your live data, so it is an inherently RAM-speed bound operation. HTH, / Thomas

Hi Andrew,
On Thu, Jun 7, 2012 at 5:39 PM, Andrew Myers
Hi Cafe, I'm working on inspecting some data that I'm trying to represent as records in Haskell and seeing about twice the memory footprint than I was expecting. I've got roughly 1.4 million records in a CSV file (400M on disk) that I parse in using bytestring-csv. bytestring-csv returns a [[ByteString]] (wrapped in `type`s) which I then convert into a list of records that have the following structure:
3 Int 1 Text Length 3 1 Text Length 11 12 Float 1 UTCTime
All fields are marked strict and have {-# UNPACK #-} pragmas (I'm guessing that doesn't do anything for non primitives). (Side note, is there a way to check if things are actually being unpacked?)
GHC used to complain when you use UNPACK with something that can't be unpacked, but that warning seems to have been (accidentally) removed in 7.4.1. The rule for unpacking is: * all product types (i.e. types with only one constructor) can be unpacked. This includes Int, Char, Double, etc and tuples or records their-of. * sum types (i.e. data types with more than one constructor) and polymorphic fields can't be unpacked.
My back of the napkin memory estimates based on the assumption that nothing is being unpacked (and my very spotty understanding of Haskell data structures):
Platform: 64 Bit Linux # Type (Sizeof type (occasionally a guess))
3 * Int (8) 14 * Char (4) -- Text is some kind of bytestring but I'm guessing it can't be worse than the same number of Char? 12 * Float (4) 18 * sizeOf (ptr) (8) UTC: -- From what I can gather through :info in ghci 4 * (ptr) (8) 2 * Integer (16) -- Shouldn't be overly large, times are within 2012
All fields in a constructor are word aligned. This means that all primitive types take 8 bytes on a 64-bit platform, including Char and Float. You might find the following blog posts by me useful in computing the size of data structures: http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.ht... http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html http://blog.johantibell.com/2011/11/slides-from-my-guest-lecture-at.html Here's some more on the topic: http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-... http://stackoverflow.com/questions/6574444/how-to-find-out-ghcs-memory-repre...
I've written a small driver test program that just parses the CSV, finds the minimum value for a couple of the Float fields, and exits. In the process monitor the memory usage is 6.9G before the program exits. I've tried profiling with +RTS -hc but it ran for >3 hours without finishing, it normally finishes within 4 minutes. Anyone have any ideas for me? Things to try? Thanks, Andrew
You could try to use a 32-bit GHC, which would use about half the memory. You're at the limit of the size of data that you can comfortably fit in memory on a normal desktop machine, so it might be time to consider a streaming approach. -- Johan

Thanks for the responses everyone, I'll try them out and see what happens :)
Andrew
On Fri, Jun 8, 2012 at 4:40 PM, Johan Tibell
Hi Andrew,
On Thu, Jun 7, 2012 at 5:39 PM, Andrew Myers
wrote: Hi Cafe, I'm working on inspecting some data that I'm trying to represent as records in Haskell and seeing about twice the memory footprint than I was expecting. I've got roughly 1.4 million records in a CSV file (400M on disk) that I parse in using bytestring-csv. bytestring-csv returns a [[ByteString]] (wrapped in `type`s) which I then convert into a list of records that have the following structure:
3 Int 1 Text Length 3 1 Text Length 11 12 Float 1 UTCTime
All fields are marked strict and have {-# UNPACK #-} pragmas (I'm guessing that doesn't do anything for non primitives). (Side note, is there a way to check if things are actually being unpacked?)
GHC used to complain when you use UNPACK with something that can't be unpacked, but that warning seems to have been (accidentally) removed in 7.4.1.
The rule for unpacking is:
* all product types (i.e. types with only one constructor) can be unpacked. This includes Int, Char, Double, etc and tuples or records their-of. * sum types (i.e. data types with more than one constructor) and polymorphic fields can't be unpacked.
My back of the napkin memory estimates based on the assumption that nothing is being unpacked (and my very spotty understanding of Haskell data structures):
Platform: 64 Bit Linux # Type (Sizeof type (occasionally a guess))
3 * Int (8) 14 * Char (4) -- Text is some kind of bytestring but I'm guessing it can't be worse than the same number of Char? 12 * Float (4) 18 * sizeOf (ptr) (8) UTC: -- From what I can gather through :info in ghci 4 * (ptr) (8) 2 * Integer (16) -- Shouldn't be overly large, times are within 2012
All fields in a constructor are word aligned. This means that all primitive types take 8 bytes on a 64-bit platform, including Char and Float. You might find the following blog posts by me useful in computing the size of data structures:
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.ht... http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html http://blog.johantibell.com/2011/11/slides-from-my-guest-lecture-at.html
Here's some more on the topic:
http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-...
http://stackoverflow.com/questions/6574444/how-to-find-out-ghcs-memory-repre...
I've written a small driver test program that just parses the CSV, finds the minimum value for a couple of the Float fields, and exits. In the process monitor the memory usage is 6.9G before the program exits. I've tried profiling with +RTS -hc but it ran for >3 hours without finishing, it normally finishes within 4 minutes. Anyone have any ideas for me? Things to try? Thanks, Andrew
You could try to use a 32-bit GHC, which would use about half the memory. You're at the limit of the size of data that you can comfortably fit in memory on a normal desktop machine, so it might be time to consider a streaming approach.
-- Johan

On Fri, Jun 8, 2012 at 1:40 PM, Johan Tibell
GHC used to complain when you use UNPACK with something that can't be unpacked, but that warning seems to have been (accidentally) removed in 7.4.1.
Turns out the warning is only on if you compile with -O or higher. -- Johan
participants (4)
-
Andrew Myers
-
Johan Tibell
-
Roman Cheplyaka
-
Thomas Schilling