New subject: memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

23 Feb 2009

      Hello,

I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.

I kind of have an algorithm in mind that I want to implement using  
Haskell,
but up until now, the main issue has been to find a way to efficiently  
represent
the data...

For people who are not familiar with the Netflix data, in short, it  
consist of
roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different  
movies, coming from
480,109 different users.

The idea I have in mind requires fast access to all the ratings for a  
particular user,
so I was thinking about an IntMap in terms of user ids, which maps to
movie id/rating pairs somehow.

To see if I could efficiently represent the data set in this way, I  
wrote a small
Haskell program (attached) which uses the following data type:

type data = IntMap (UArray Int Word8) -- map of user IDs to ratings  
(lacks movie IDs)

For convenience, and because I've been discussing this with various  
people in #haskell @ IRC,
the code is also available here: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=1671#a1671 
  .

I'm aware that the way in which the UArray's are constructed (i.e. by  
adding a single element each
time), is far from efficient performance-wise, but that's not the  
point here...
By reformatting the raw data, I can easily read in the data more  
efficiently.

The issue I ran into is that changing the data type to the following,  
doesn't yield any significant
different in memory usage.

type data = IntMap (UArray Int Int) -- use Int instead of Word8 for a  
user rating

Someone (dolio) @ #haskell suggested that maybe UArray is not byte  
packed for Word8,
which would cause little difference with a UArray containing Int's,  
but someone else (dons @ #ghc)
was able to tell me it _is_ byte packed.

Does anyone know why the Word8 version is not significantly better in  
terms of memory usage?

greetings,

Kenneth

PS: My adventures on trying to tackle the Netflix Prize problem with  
Haskell can be followed at http://boegel.kejo.be.

-- 

Kenneth Hoste
ELIS - Ghent University
email: kenneth.hoste@elis.ugent.be
blog: http://www.elis.ugent.be/~kehoste/blog
website: http://www.elis.ugent.be/~kehoste

memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

Kenneth Hoste

Bryan O'Sullivan

Don Stewart

Kenneth Hoste

wren ng thornton

Manlio Perillo

Kenneth Hoste

Manlio Perillo

Manlio Perillo

tags

participants (5)