Thanks for the responses everyone, I'll try them out and see what happens :)
Andrew
Hi Andrew,
GHC used to complain when you use UNPACK with something that can't be
On Thu, Jun 7, 2012 at 5:39 PM, Andrew Myers <asm198@gmail.com> wrote:
> Hi Cafe,
> I'm working on inspecting some data that I'm trying to represent as records
> in Haskell and seeing about twice the memory footprint than I was
> expecting. I've got roughly 1.4 million records in a CSV file (400M on
> disk) that I parse in using bytestring-csv. bytestring-csv returns a
> [[ByteString]] (wrapped in `type`s) which I then convert into a list of
> records that have the following structure:
>
>> 3 Int
>> 1 Text Length 3
>> 1 Text Length 11
>> 12 Float
>> 1 UTCTime
>
> All fields are marked strict and have {-# UNPACK #-} pragmas (I'm guessing
> that doesn't do anything for non primitives). (Side note, is there a way to
> check if things are actually being unpacked?)
unpacked, but that warning seems to have been (accidentally) removed
in 7.4.1.
The rule for unpacking is:
* all product types (i.e. types with only one constructor) can be
unpacked. This includes Int, Char, Double, etc and tuples or records
their-of.
* sum types (i.e. data types with more than one constructor) and
polymorphic fields can't be unpacked.
All fields in a constructor are word aligned. This means that all
> My back of the napkin memory estimates based on the assumption that nothing
> is being unpacked (and my very spotty understanding of Haskell data
> structures):
>
> Platform: 64 Bit Linux
> # Type (Sizeof type (occasionally a guess))
>
> 3 * Int (8)
> 14 * Char (4) -- Text is some kind of bytestring but I'm guessing it can't
> be worse than the same number of Char?
> 12 * Float (4)
> 18 * sizeOf (ptr) (8)
> UTC: -- From what I can gather through :info in ghci
> 4 * (ptr) (8)
> 2 * Integer (16) -- Shouldn't be overly large, times are within 2012
primitive types take 8 bytes on a 64-bit platform, including Char and
Float. You might find the following blog posts by me useful in
computing the size of data structures:
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html
http://blog.johantibell.com/2011/11/slides-from-my-guest-lecture-at.html
Here's some more on the topic:
http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-types
http://stackoverflow.com/questions/6574444/how-to-find-out-ghcs-memory-representations-of-data-types
You could try to use a 32-bit GHC, which would use about half the
> I've written a small driver test program that just parses the CSV, finds the
> minimum value for a couple of the Float fields, and exits. In the process
> monitor the memory usage is 6.9G before the program exits. I've tried
> profiling with +RTS -hc but it ran for >3 hours without finishing, it
> normally finishes within 4 minutes. Anyone have any ideas for me? Things
> to try?
> Thanks,
> Andrew
memory. You're at the limit of the size of data that you can
comfortably fit in memory on a normal desktop machine, so it might be
time to consider a streaming approach.
-- Johan