help optimizing memory usage for a program

Hi. After some work I have managed to implement two simple programs that parse the Netflix Prize data set. For details about the Netflix Prize, there was a post by Kenneth Hoste some time ago. I have cabalized the program, and made available here: http://haskell.mperillo.ath.cx/netflix-0.0.1.tar.gz The process-data-1 program parse the training data set, grouping ratings by movies. The process-data-2 program, instead, groups ratings by users. The structure of the two programs is similar: I create singletons IntMap and then use foldl + union to merge them together. The data structure used is: type Rating = Word32 :*: Word8 -- id, rating type MovieRatings = IntMap (UArr Rating) UArr is from uvector package. The training data set contains about 100 million ratings, so, in theory, the amount of memory required to store pairs of (id, rating) is about 480 MB. The process-data-1 program parse the entire dataset using about 1.4 GB of memory (3x increment). This is strange. The memory required is proportional to the number of ratings. It may be IntMap the culprit, or the garbage collector than does not release the memory to the operating system (or, worse, does not deallocate all used temporary memory). The process-data-2 has much more problems. When I try to parse *only* 500 movie ratings, the memory usage is 471 MB; 780 MB after all data has been fully evaluated (array concatenations). I'm using ghc 6.8.2 on Debian Etch. I would like to do some tests with recent GHC versions, but it may be a problem. Thanks Manlio Perillo

Hello Manlio, Monday, March 2, 2009, 6:30:51 PM, you wrote:
The process-data-1 program parse the entire dataset using about 1.4 GB of memory (3x increment).
This is strange. The memory required is proportional to the number of ratings. It may be IntMap the culprit, or the garbage collector than does not release the memory to the operating system (or, worse, does not deallocate all used temporary memory).
ghc has 2 garbage collectors. first is copying, usied by default (because it's faster): when previously allocated memory block overflows, it collects used data by copying them into *new* data block allocated from OS and then use old data block to satisfy further memory requests. let's calculate. if at GC moment your program has allocated 100 mb of memory and only 50 mb was not a garbage, then memory usage will be 150 mb another GC is compacting one - it's about 2x slower, but collects data in-place. moreover, you may set up"growing factor". with a g.f. of 1.5, for example, memory will be collected once heap will become 1.5x larger than real memory usage after last GC. this effectively guarantees that memory overhead will never be over this factor look at GHC manual, RTS switches section. and last - GHC never returns memory to the OS, there should be ticket on this -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin ha scritto:
Hello Manlio,
Monday, March 2, 2009, 6:30:51 PM, you wrote:
The process-data-1 program parse the entire dataset using about 1.4 GB of memory (3x increment).
This is strange. The memory required is proportional to the number of ratings. It may be IntMap the culprit, or the garbage collector than does not release the memory to the operating system (or, worse, does not deallocate all used temporary memory).
ghc has 2 garbage collectors.
[...]
I already tried to enable the compacting collection (-c RTS option). Here are some numbers (when parsing only 5000 movie files, with process-data-1): 1) With default collection algorithm, I have: real 1m4.599s user 0m49.131s sys 0m1.492s 409 MB 2) With -c option: real 1m45.197s user 0m59.248s sys 0m1.640s 418 MB So, nothing changed.
moreover, you may set up"growing factor". with a g.f. of 1.5, for example, memory will be collected once heap will become 1.5x larger than real memory usage after last GC. this effectively guarantees that memory overhead will never be over this factor
Thanks. This seems to be effective (but it also reduce performances). 3) With -F1 option real 9m59.789s user 9m41.844s sys 0m5.608s 264 MB I have to parse the whole data set, to check if memory usage is good.
look at GHC manual, RTS switches section. and last - GHC never returns memory to the OS, there should be ticket on this
The same problem with Python. By the way: I have written the first version of the program to parse Netflix training data set in D. I also used ncpu * 1.5 threads, to parse files concurrently. However execution was *really* slow, due to garbage collection. I have also tried to disable garbage collection, and to manually run a garbage cycle from time to time (every 200 file parsed), but the performance were the same. Running the Haskell version with -F1 *seems* (I have to check) to be as slow as the D version. So, it seems that for this class of problems it is better to do manual memory management (and it is not even hard). When I find some time I will try to write a C++ version (I also have a Python version, but it is very memory inefficient). At least I hope that I can serialize the parsed data in a binary file, and then read it again, with optimized memory usage. Thanks Manlio Perillo

Hello Manlio, Monday, March 2, 2009, 8:16:10 PM, you wrote:
1) With default collection algorithm, I have:
2) With -c option:
So, nothing changed.
you should look into +RTS -s stats. those 409 vs 418 mb is just somewhat random values, since GCs in those 2 inviocations are not synchronized. you should run program severaltimes with various sized datasets. and take a look into other RTS options, i think it would be better to fix heap size so that you will have maximum performance for the given amount of memory used. also try to make -A option close to L2 cache size -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Manlio, Monday, March 2, 2009, 8:16:10 PM, you wrote:
By the way: I have written the first version of the program to parse Netflix training data set in D. I also used ncpu * 1.5 threads, to parse files concurrently.
However execution was *really* slow, due to garbage collection. I have also tried to disable garbage collection, and to manually run a garbage cycle from time to time (every 200 file parsed), but the performance were the same.
may be it will be better to use somewhat like MapReduce and split your job into 100-file parts which are processed by ncpu concurrently executed scripts? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin ha scritto:
Hello Manlio,
Monday, March 2, 2009, 8:16:10 PM, you wrote:
By the way: I have written the first version of the program to parse Netflix training data set in D. I also used ncpu * 1.5 threads, to parse files concurrently.
However execution was *really* slow, due to garbage collection. I have also tried to disable garbage collection, and to manually run a garbage cycle from time to time (every 200 file parsed), but the performance were the same.
may be it will be better to use somewhat like MapReduce and split your job into 100-file parts which are processed by ncpu concurrently executed scripts?
For process-data-1 program there is no real need, since it is already fast enough (8 minutes on my laptop, and too memory usage is not a problem, unless it required more then 2 GB). For process-data-2 there is some code that is left unevaluated (the array concatenation). Simple file parsing is quite fast. Most of the time is spent (IMHO) concatenating arrays in foldl' (unionWith concatU), in the main function. This should be possible to run in parallel, with MapReduce; I have to check. But if parsing is so slow, there are only two solutions: 1) write the parser in C++, and then serialize the data in a compact binary format, to be read by Haskell [1] But, in this case, there are no reasons to write a piece of code in C++ and the other in Haskell ;-) 2) reimplement process-data-2 so that instead of grouping ratings by users in an IntMap, accumulate ratings in separate files (each per user, for a total of 480189 files [2]). This will avoid the need of array concatenation. Then parse the data again, and this should be more memory/GC friendly. [1] hoping that array creation from a stream is memory efficient. [2] versus 17770 files with ratings grouped by movies Regards Manlio Perillo

Manlio Perillo ha scritto:
[...]
moreover, you may set up"growing factor". with a g.f. of 1.5, for example, memory will be collected once heap will become 1.5x larger than real memory usage after last GC. this effectively guarantees that memory overhead will never be over this factor
Thanks. This seems to be effective (but it also reduce performances).
3) With -F1 option [...] I have to parse the whole data set, to check if memory usage is good.
Ok, done: real 49m7.369s user 45m21.642s sys 0m18.893s 814 MB used This is better memory usage, respect to: real 7m17.853s user 3m38.506s sys 0m7.612s 1586 MB used However, Kenneth Hoste reported (http://boegel.kejo.be/): 26 minutes, with 700 MB used. Maybe he was using the latest GHC version. I would also like to check how performances are with other functional languages. Regards Manlio Perillo

On Mar 2, 2009, at 19:13 , Manlio Perillo wrote:
Manlio Perillo ha scritto:
[...]
moreover, you may set up"growing factor". with a g.f. of 1.5, for example, memory will be collected once heap will become 1.5x larger than real memory usage after last GC. this effectively guarantees that memory overhead will never be over this factor
Thanks. This seems to be effective (but it also reduce performances). 3) With -F1 option [...] I have to parse the whole data set, to check if memory usage is good.
Ok, done:
real 49m7.369s user 45m21.642s sys 0m18.893s
814 MB used
This is better memory usage, respect to:
real 7m17.853s user 3m38.506s sys 0m7.612s
1586 MB used
However, Kenneth Hoste reported (http://boegel.kejo.be/):
26 minutes, with 700 MB used.
Maybe he was using the latest GHC version. I would also like to check how performances are with other functional languages.
The 26m/700MB I mentioned on my blog was on my ancient PowerBook G4 (1.5GHz PowerPC G4, 1.25G). I redid the same experiment on our iMac (Core2 Duo, 2.0 GHz, 3.0G), i.e.: - read in all the data - count the number of keys in the IntMap (which should be 17,770, i.e. the number of movies) - compute the mean overall movie rating (which should be 3.6033) That task was done in 3m42s, using just 632M of memory, using the following command: ./netflix ./training_set/ +RTS -A128M -s The -A option makes sure GC isn't cleaning up stuff too frequently, while -s just reports some statistics. The way in which I'm reading in the data is somewhat different from yours. I construct the IntMap from the ground up, i.e. starting with an empty IntMap and using foldM, together with a function readMovie with the following type: readMovie :: IntMap (UArray Int Int, UArray Int Word8) -> FilePath -> IO (IntMap (UArray Int Int, UArray Int Word8)) In readMovie, I'm using the 'insert' function provided by the IntMap module, which justs insert a new key-value pair in the existing IntMap. Your approach is very different: you create 17,770 IntMaps with a single key/value pair in them, and then use union to combine them all. I profiled your approach on the same iMac (using the same +RTS options), and it needed 4m33s to run, using 948M of memory. I think my approach is turning out better because I'm: - building up the IntMap using 'empty' and 'insert', instead of combining 17,770 'singleton' IntMaps (which probably results better GC behavior) - using UArray instead of Urr (although I don't know if that actually makes a difference here) I hope this helps you with figuring out what the bottleneck is on your side. It took me several days to come up with this approach, with the help from various Haskellers at IRC, so I'm surely no expert... I've thrown my current code online at http://boegel.kejo.be/files/Netflix_read-and-parse_24-02-2009.hs , let me know if it's helpful in any way... Also, I was indeed using GHC 6.10.1, although I'm unsure to what extent that matter. greetings, Kenneth -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.hoste@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be

Hello Kenneth, Monday, March 2, 2009, 11:14:27 PM, you wrote:
I think my approach is turning out better because I'm:
- building up the IntMap using 'empty' and 'insert', instead of combining 17,770 'singleton' IntMaps (which probably results better GC behavior)
i don't read into details, but may be it will be better to use sort-and-group approach? OTOH, this makes interesting usecase for mapreduce technologies -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hi Kenneth,
I've thrown my current code online at http://boegel.kejo.be/files/Netflix_read-and-parse_24-02-2009.hs , let me know if it's helpful in any way...
Maybe you could set up a darcs repo for this, such that we can submit patches against your code? -- Andy

Kenneth Hoste ha scritto:
[...] The 26m/700MB I mentioned on my blog was on my ancient PowerBook G4 (1.5GHz PowerPC G4, 1.25G).
I redid the same experiment on our iMac (Core2 Duo, 2.0 GHz, 3.0G), i.e.: - read in all the data - count the number of keys in the IntMap (which should be 17,770, i.e. the number of movies) - compute the mean overall movie rating (which should be 3.6033)
That task was done in 3m42s, using just 632M of memory, using the following command:
./netflix ./training_set/ +RTS -A128M -s
The -A option makes sure GC isn't cleaning up stuff too frequently, while -s just reports some statistics.
Ok, thanks.
The way in which I'm reading in the data is somewhat different from yours. I construct the IntMap from the ground up, i.e. starting with an empty IntMap and using foldM, together with a function readMovie with the following type:
readMovie :: IntMap (UArray Int Int, UArray Int Word8) -> FilePath -> IO (IntMap (UArray Int Int, UArray Int Word8))
In readMovie, I'm using the 'insert' function provided by the IntMap module, which justs insert a new key-value pair in the existing IntMap.
This is the same thing I was doing in a previous version. Some tests showed me that the performances (and memory usage) were pratically the same, so I switched to the new implementation.
Your approach is very different: you create 17,770 IntMaps with a single key/value pair in them, and then use union to combine them all.
I profiled your approach on the same iMac (using the same +RTS options), and it needed 4m33s to run, using 948M of memory.
This is very strange, because in my tests I got very similar execution time and memory usage. Did you ran my program? Or did you just modified your code?
I think my approach is turning out better because I'm:
- building up the IntMap using 'empty' and 'insert', instead of combining 17,770 'singleton' IntMaps (which probably results better GC behavior) - using UArray instead of Urr (although I don't know if that actually makes a difference here)
I hope this helps you with figuring out what the bottleneck is on your side. It took me several days to come up with this approach, with the help from various Haskellers at IRC, so I'm surely no expert...
I've thrown my current code online at http://boegel.kejo.be/files/Netflix_read-and-parse_24-02-2009.hs , let me know if it's helpful in any way...
Thanks for sharing! I have executed your program: real 6m6.394s user 1m6.444s sys 0m7.180s 640 MB usage So memory usage don't changes with new versions of GHC (did you tried the new parallel garbage collector?) As for running time, it is my Hard Disk that is slower (the CPU is the same, but I have 2G of RAM). After reading your code I have a suspect. I use ByteString.Lazy to read the file content, and you instead use a strict ByteString. So I have updated the code to use a strict ByteString (maybe an even better solution is to use bytestring-mmap? [1]) I have executed the program, using the same RTS flags as yours: real 6m13.523s user 0m53.931s sys 0m7.812s 815 MB usage This is an huge improvement! Now I have to check if using insert will further improve memory usage. It's unfortunate that GC is not friendly with singleton + union solution.
Also, I was indeed using GHC 6.10.1, although I'm unsure to what extent that matter.
greetings,
Kenneth
[1] unfortunately with bytestring-mmap I'm not sure to have control over resources usage Regards Manlio

Manlio Perillo ha scritto:
[...] I have executed the program, using the same RTS flags as yours:
real 6m13.523s user 0m53.931s sys 0m7.812s
815 MB usage
This is an huge improvement!
Now I have to check if using insert will further improve memory usage.
And ... surprise! real 6m12.036s user 0m51.303s sys 0m8.433s 813 MB usage As I suspected (you just have to read the IntMap code), there is *no reason* why empty + insert should be better then singleton + union. So the culprit is elsewhere. It may be the use of uvector, or probably it is because you force strict evaluation more often then me.
[...]
Regards Manlio

Manlio Perillo ha scritto:
Manlio Perillo ha scritto:
[...] I have executed the program, using the same RTS flags as yours:
real 6m13.523s user 0m53.931s sys 0m7.812s
815 MB usage
This is an huge improvement!
Using UArray and empty + insert: real 5m40.732s user 0m58.620s sys 0m7.264s 660 MB usage Using singleton + union, memory usage is the same: 657 MB So, now: 1) I don't understand why using uvector leaks memory. I'm using the latest version found on Hackage. 2) Why your version of code, using singleton + union, consumes more memory.
[...]
Regards Manlio Perillo

Excerpts from Bulat Ziganshin's message of Mon Mar 02 10:14:35 -0600 2009:
let's calculate. if at GC moment your program has allocated 100 mb of memory and only 50 mb was not a garbage, then memory usage will be 150 mb
? A copying collector allocates a piece of memory (say 10mb) which is used as the heap, and only one *half* of it ever has data in it. When a copy occurs, all live data is transferred from one half to the other half. So, if you have a 10mb heap initially, that means you have two sides of the heap, both which are 5mb big. If your program has 2mb of live data at the time of a GC, then those 2mb of data are copied over to the other 5mb side. This copying can be done with pointer arithmetic in a basic manner, so you don't need to allocate any more memory for a copy to occur. So the program never uses more than 10mb heap. GHC might be a little different (so correct me if I'm wrong in the case of GHC.) Copying collection has the advantage that for low-lived allocations, GC moments are pretty quick, because the time spent copying is proportional to the size of the live data. It gets slower if your program has lots of live data at once, not only because collections become more frequent, but because copies will become longer in time as well. Austin

Hello Austin, Monday, March 2, 2009, 11:51:52 PM, you wrote:
let's calculate. if at GC moment your program has allocated 100 mb of memory and only 50 mb was not a garbage, then memory usage will be 150 mb
? A copying collector allocates a piece of memory (say 10mb) which is used as the heap, and only one *half* of it ever has data in it. When
if you interested, i suggest you to run any memory-eater with +RTS -S switch. it's somewhat hard to decrypt, but finally you will see what i said. let's imagine that you have 10mb heap at moment of GC, and only 7 mb are live. when doing GC, ghc copies live data into new blocks allocated from the OS. so after GC program will occupy 10+7 mb. ghc can't return memory back to the OS, so it uses those 10 mb to fulfill further memory requests. as a result, next GC will occur when 17 mb will be consumed you have described some fixed-pool scheme which is probably simplified description of idea don't taking into account dynamic poll growth -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

manlio_perillo:
Hi.
After some work I have managed to implement two simple programs that parse the Netflix Prize data set.
For details about the Netflix Prize, there was a post by Kenneth Hoste some time ago.
I have cabalized the program, and made available here: http://haskell.mperillo.ath.cx/netflix-0.0.1.tar.gz
The process-data-1 program parse the training data set, grouping ratings by movies. The process-data-2 program, instead, groups ratings by users.
The structure of the two programs is similar: I create singletons IntMap and then use foldl + union to merge them together.
The data structure used is:
type Rating = Word32 :*: Word8 -- id, rating type MovieRatings = IntMap (UArr Rating)
UArr is from uvector package.
The training data set contains about 100 million ratings, so, in theory, the amount of memory required to store pairs of (id, rating) is about 480 MB.
The process-data-1 program parse the entire dataset using about 1.4 GB of memory (3x increment).
This is strange. The memory required is proportional to the number of ratings. It may be IntMap the culprit, or the garbage collector than does not release the memory to the operating system (or, worse, does not deallocate all used temporary memory).
The process-data-2 has much more problems. When I try to parse *only* 500 movie ratings, the memory usage is 471 MB; 780 MB after all data has been fully evaluated (array concatenations).
I'm using ghc 6.8.2 on Debian Etch. I would like to do some tests with recent GHC versions, but it may be a problem.
Seems like a useful benchmark of a number of things. Any chance you'll upload it to hackage?

Don Stewart ha scritto:
manlio_perillo:
Hi.
After some work I have managed to implement two simple programs that parse the Netflix Prize data set.
For details about the Netflix Prize, there was a post by Kenneth Hoste some time ago.
I have cabalized the program, and made available here: http://haskell.mperillo.ath.cx/netflix-0.0.1.tar.gz
[...]
Seems like a useful benchmark of a number of things. Any chance you'll upload it to hackage?
Not the package as it is now. And I first need to write a program that generates a reasonable data set. But, yes, I will upload it to hackage, if it can be useful. What name should I use? How should I categorize it? Regards Manlio Perillo
participants (6)
-
Andy Georges
-
Austin Seipp
-
Bulat Ziganshin
-
Don Stewart
-
Kenneth Hoste
-
Manlio Perillo