memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

Hello, I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave. I kind of have an algorithm in mind that I want to implement using Haskell, but up until now, the main issue has been to find a way to efficiently represent the data... For people who are not familiar with the Netflix data, in short, it consist of roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different movies, coming from 480,109 different users. The idea I have in mind requires fast access to all the ratings for a particular user, so I was thinking about an IntMap in terms of user ids, which maps to movie id/rating pairs somehow. To see if I could efficiently represent the data set in this way, I wrote a small Haskell program (attached) which uses the following data type: type data = IntMap (UArray Int Word8) -- map of user IDs to ratings (lacks movie IDs) For convenience, and because I've been discussing this with various people in #haskell @ IRC, the code is also available here: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=1671#a1671 . I'm aware that the way in which the UArray's are constructed (i.e. by adding a single element each time), is far from efficient performance-wise, but that's not the point here... By reformatting the raw data, I can easily read in the data more efficiently. The issue I ran into is that changing the data type to the following, doesn't yield any significant different in memory usage. type data = IntMap (UArray Int Int) -- use Int instead of Word8 for a user rating Someone (dolio) @ #haskell suggested that maybe UArray is not byte packed for Word8, which would cause little difference with a UArray containing Int's, but someone else (dons @ #ghc) was able to tell me it _is_ byte packed. Does anyone know why the Word8 version is not significantly better in terms of memory usage? greetings, Kenneth PS: My adventures on trying to tackle the Netflix Prize problem with Haskell can be followed at http://boegel.kejo.be. -- Kenneth Hoste ELIS - Ghent University email: kenneth.hoste@elis.ugent.be blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste

2009/2/23 Kenneth Hoste
Does anyone know why the Word8 version is not significantly better in terms of memory usage?
Yes, because there's a typo on line 413 of Data/Array/Vector/Prim/BUArr.hs. How's that for service? :-)

bos:
2009/2/23 Kenneth Hoste
Does anyone know why the Word8 version is not significantly better in terms of memory usage?
Yes, because there's a typo on line 413 of Data/Array/Vector/Prim/BUArr.hs.
How's that for service? :-)
UArray or UArr?

On Feb 23, 2009, at 19:57 , Don Stewart wrote:
bos:
2009/2/23 Kenneth Hoste
Does anyone know why the Word8 version is not significantly better in terms of memory usage?
Yes, because there's a typo on line 413 of Data/Array/Vector/Prim/ BUArr.hs.
How's that for service? :-)
UArray or UArr?
Well, I'm using UArray, but I'm willing to consider other suitable containers... As long as they are memory efficient. :-) The typical usage of a UArray will be getting all it's contents, and converting it to a list to easily manipulate (filter, ...). So, maybe another data type allows me to store the data in a limited amount of memory (which is my main concern now)... K. -- Kenneth Hoste ELIS - Ghent University email: kenneth.hoste@elis.ugent.be blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste

Kenneth Hoste wrote:
Well, I'm using UArray, but I'm willing to consider other suitable containers... As long as they are memory efficient. :-)
The typical usage of a UArray will be getting all it's contents, and converting it to a list to easily manipulate (filter, ...).
So, maybe another data type allows me to store the data in a limited amount of memory (which is my main concern now)...
Have you considered using spectral (or counting) bloom filters? I know there's a non-spectral version available[1], though I haven't looked at it to see how easily it could be made to count. [1] http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter -- Live well, ~wren

Kenneth Hoste ha scritto:
Hello,
I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.
[...] To see if I could efficiently represent the data set in this way, I wrote a small Haskell program (attached) which uses the following data type:
From what I see, to append a new integer to the Array, you convert the array to a list, append the new element to the list, and then convert to array again. Isn't this a bit inefficient? The uvector package implements a vector of unboxed types, and has an snocU operation, to append an element to the array. I don't know how efficient it is, however. By the way, about uvector: it has a Stream data type, and you can build a vector from a stream. But how this work and how (if any) the stream data is integrated with other packages? The package documentations seems to be still incomplete.
[...]
Regards Manlio Perillo

On Feb 26, 2009, at 13:00 , Manlio Perillo wrote:
Kenneth Hoste ha scritto:
Hello, I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave. [...] To see if I could efficiently represent the data set in this way, I wrote a small Haskell program (attached) which uses the following data type:
From what I see, to append a new integer to the Array, you convert the array to a list, append the new element to the list, and then convert to array again.
Isn't this a bit inefficient?
Yes, performance-wise this is terribly inefficient, I agree. But, it was just an artefact of how the raw data is organized. My main concern was the memory usage of the huge IntMap with UArray elements. Once I solved that, I would be able to get around the performance issue by reorganizing the raw data. However, as I posted yesterday, I've been able to circumvent the issue by rethinking my data type, i.e. using the ~18K movie IDs as key instead of the 480K user IDs, which radically limits the overhead... That way, I'm able to fit the data set in <700M of memory, without having to reorganize the raw data.
The uvector package implements a vector of unboxed types, and has an snocU operation, to append an element to the array.
I don't know how efficient it is, however.
By the way, about uvector: it has a Stream data type, and you can build a vector from a stream.
Thanks for letting me know, I'll keep this in mind. greetings, Kenneth -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.hoste@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be

Kenneth Hoste ha scritto:
[...] However, as I posted yesterday, I've been able to circumvent the issue by rethinking my data type, i.e. using the ~18K movie IDs as key instead of the 480K user IDs, which radically limits the overhead...
Well, but what if you really need the original data structure, for better data processing?
That way, I'm able to fit the data set in <700M of memory, without having to reorganize the raw data.
The uvector package implements a vector of unboxed types, and has an snocU operation, to append an element to the array.
I don't know how efficient it is, however.
By the way, about uvector: it has a Stream data type, and you can build a vector from a stream.
Thanks for letting me know, I'll keep this in mind.
Let me know if there are performance improvements. Arrays are one of the few things I dislike in Haskell, and all the available array/vector packages cause me some confusion. Regards Manlio

Kenneth Hoste ha scritto:
Hello,
I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.
I kind of have an algorithm in mind that I want to implement using Haskell, but up until now, the main issue has been to find a way to efficiently represent the data...
For people who are not familiar with the Netflix data, in short, it consist of roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different movies, coming from 480,109 different users.
Hi Kenneth. I have written a simple program that parses the Netflix training data set, using this data structure: type MovieRatings = IntMap (UArr Word32, UArr Word8) The ratings are grouped by movies. The parsing is done in: real 8m32.476s user 3m5.276s sys 0m8.681s On a DELL Inspiron 6400 notebook, Intel Core2 T7200 @ 2.00GHz, and 2 GB memory. However the memory used is about 1.4 GB. How did you manage to get 700 MB memory usage? Note that the minimum space required is about 480 MB (assuming 4 byte integer for the ID, and 1 byte integer for rating). Using a 4 byte integer for both ID and rating, the space required is about 765 MB. 1.5 GB is the space required if one uses a total of 16 bytes to store both the ID and the rating. Maybe it is the garbage collector that does not release memory to the operating system? Thanks Manlio Perillo
participants (5)
-
Bryan O'Sullivan
-
Don Stewart
-
Kenneth Hoste
-
Manlio Perillo
-
wren ng thornton