Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

1 Mar 2009

      Kenneth Hoste ha scritto:
...
Hello,
I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.
I kind of have an algorithm in mind that I want to implement using Haskell,
but up until now, the main issue has been to find a way to efficiently 
represent
the data...
For people who are not familiar with the Netflix data, in short, it 
consist of
roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different 
movies, coming from
480,109 different users.
Hi Kenneth.

I have written a simple program that parses the Netflix training data 
set, using this data structure:

type MovieRatings = IntMap (UArr Word32, UArr Word8)

The ratings are grouped by movies.

The parsing is done in:
real	8m32.476s
user	3m5.276s
sys	0m8.681s

On a DELL Inspiron 6400 notebook,
Intel Core2 T7200 @ 2.00GHz, and 2 GB memory.

However the memory used is about 1.4 GB.
How did you manage to get 700 MB memory usage?

Note that the minimum space required is about 480 MB (assuming 4 byte 
integer for the ID, and 1 byte integer for rating).
Using a 4 byte integer for both ID and rating, the space required is 
about 765 MB.

1.5 GB is the space required if one uses a total of 16 bytes to store 
both the ID and the rating.

Maybe it is the garbage collector that does not release memory to the 
operating system?

Thanks  Manlio Perillo