UArray Word16 Word32 uses twice as much memory as it should?

Hello, I am having an issue with these unboxed arrays. I have some code that creates this structure:: (Array Word16 (UArray Int Word32), Array Word16 (UArray Int Word8)), and I am finding that it uses about twice as much memory as I had anticipated. This tuple is returned strict, and I think I haven't left much room for other data remaining in memory. It should hold one Word8 and one Word32 for a data set of 100 million records, and it uses around 1 gigabyte. By my calculations, it should be half that. So I was wondering if I might have hit upon a 64-bit vs 32-bit issue. I compile with: ghc --make load05 -O1 -funbox-strict-fields -XBangPatterns -fvia-C (also tried with -O2, and without the -fvia-C) using the packaged version of ghc 6.10.1 on MacOSX 10.5.5 Grateful for any pointers, Arne D Halvorsen

Hello Arne, Wednesday, November 19, 2008, 11:57:01 AM, you wrote:
finding that it uses about twice as much memory as I had anticipated.
it may be 1) GC problem (due to GC haskell programs occupies 2-3x more memory than actually used) 2) additional data (you not said how long each small array. you should expect 10-30 additional bytes used for every array) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Arne,
Wednesday, November 19, 2008, 11:57:01 AM, you wrote:
finding that it uses about twice as much memory as I had anticipated.
Hello, and thank you for your reply.
it may be 1) GC problem (due to GC haskell programs occupies 2-3x more memory than actually used)
I wasn't aware of that - but it should be possible to trigger a GC after loading a whole lot of data?
2) additional data (you not said how long each small array. you should expect 10-30 additional bytes used for every array)
The arrays represent the netflix data set: 100 000 000 ratings, given for 17770 films. For each the films, I want to hold (on average, roughly) 2000 ratings, held as one person id (32-bit) and one rating (8-bit), in the respctive arrays. (In addition, I want to be able to load the inversion of this data: for all persons, I want to hold their ratings in a similar way: 16-bit film id, 8-bit rating. There are 480000 persons, so this should be on average 200 entries per person. I have coded a few approaches to inverting this, but I can't allocate the array before traversing the data, because I don't know the sizes. How can one go about inverting this data in memory? It seems that any kind of laziness will fill the whole memory before I have traversed the whole set - and if I use several accumArrays, it seems that it will hold the whole uncompacted dataset in memory between accumArrays. Ideally I want to hold all ratings as well as statistics for all films, and the same for all the persons - and then have room to spare for running an algorithm... Best regards, Arne D Halvorsen
participants (2)
-
Arne Dehli Halvorsen
-
Bulat Ziganshin