Re: [Haskell-cafe] uvector package appendU: memory leak?

wren ng thornton ha scritto:
Manlio Perillo wrote:
Since ratings for each customers are parsed "at the same time", using a plain list would consume a lot of memory, since stream fusion can only be executed at the end of the parsing.
On the other hand, when I want to group ratings by movies, stream fusion seems to work fine.
[...]
For the problem as you've discussed it, I'd suggest a different approach: You can't fit all the data into memory at once, so you shouldn't try to. You should write one program that takes in the per-movie grouping of data and produces a per-user file as output.
Well, creating 480189 files in a directory is not a very nice thing to do to a normal file system. I should arrange files in directory, but then this starts to become too complex. The solution I'm using now just works. It takes about 950 MB of memory and 35 minutes, but it's not a big problem since: 1) Once loaded, I can serialize the data in binary format 2) I think that the program can be parallelized, parsing subsets of the files in N threads, and then merging the maps. Using this method, should optimize array copying. The problem is that unionWith seems to be lazy, and there is no no strict variant; I'm not sure.
Then have your second program read in the reorganized data and do fusion et al.
This reduces the problem to just writing the PerMovie -> PerUser program. Since you still can't fit all the data into memory, that means you can't hope to write the per-user file in one go.
The data *do* fit into memory, fortunately.
[...]
Best of luck.
Thanks Manlio

Manlio Perillo wrote:
wren ng thornton ha scritto:
Manlio Perillo wrote:
Since ratings for each customers are parsed "at the same time", using a plain list would consume a lot of memory, since stream fusion can only be executed at the end of the parsing.
On the other hand, when I want to group ratings by movies, stream fusion seems to work fine.
[...]
For the problem as you've discussed it, I'd suggest a different approach: You can't fit all the data into memory at once, so you shouldn't try to. You should write one program that takes in the per-movie grouping of data and produces a per-user file as output.
Well, creating 480189 files in a directory is not a very nice thing to do to a normal file system.
I said *a* per-user file; that is, a file of (Movie,User,Rating) records sorted by User, as opposed to your current file which is sorted by Movie. As for the interim binning process, one file per user is only in the extreme case. Honestly, you should be able to get away with just reading through the file and splitting it into a handful of bins (8 or 16 at most) and assigning each user, u, to bin u = u `mod` qtyBins. The bins will be of uneven sizes, but they should be even enough that each is small enough to fit into memory. For this size data you can probably even leave the bins as a bag of (User,Movie,Rating) records and forgo the follow up step of sorting each bin by User. -- Live well, ~wren
participants (2)
-
Manlio Perillo
-
wren ng thornton