
On Tue, May 31, 2011 at 11:43:27AM -0700, Brandon Moore wrote:
I can't reproduce heap usage growing with the size of the input file.
I made a word list from Project Gutenberg's copy of "War and Peace" by
tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt
Using 1, 25, or 1000 repetitions of this ~3MB wordlist shows about 100MB of address space used according to top, and no more than 5MB or so of haskell heap used according to the memory profile, with a flat memory profile.
This will lead to very small variance in your data. The effect I'm experiencing is so small in this case, that it's barely observable (but it is, see below.)
Is your memory usage growing with the size of the input file, or the size of the histogram?
While the histogram is naturally growing with the size of the input file, memory
usage seems to be proportional mainly to the histogram. It is clear that, due to
the effect of the Long Tail, the histogram is going to constantly grow in a real
setting, as opposed to just replicating the same data. In your test case, the
histogram is *not* growing with the size of the input file.
The memory usage is proportional to the histogram, which is proportional to the
file size. That is not the problem. The problem is, that, compared to the size
of the histogram, the memory consumption is *inadequately* high. Here's some
more data, using your Tolstoi example:
du file.txt
344M file.txt
<
I was worried data sharing might mean your keys retain entire 64K chunks of the input. However, it seems enumLines depends on the StringLike ByteString instance, which just converts to and from String.
Ouch, that sounds like something worth fixing.
The other thing that occurs to me is that the total size of your keys would also be approximately the size of the input file if you were using plain text without each word split onto a separate line.
Well, I am not. The corpus is a word-per-line corpus, I'm reading a word per line, and adding that to my hash map. This should never result in a data structure even close to the size of the original corpus. It could be, in a very unrealistic worst case scenario. But even a corpus of 30GB of Poe and Heidegger isn't going to make that happen. Furthermore, mine is not such a scenario at all. As I said, if you reduce the corpus to a set of words (i.e. a set of unigrams) you get a 40MB file from a 1.4GB corpus. Why is it, that in order to create that 40MB file from a 1.4GB corpus, my trivial little program needs somewhere north of 6-8 GB of RAM? In this trivial example for War and Peace, why is it that in order to create the unigram table for War and Peace, which ends up being a mere 201KB big, we're chomping through 5MB on average, and nearly 10MB max? That's at least 25 times more than we actually *should* have (yes, I know that the RTS is somewhere there, too, but I think it's not a problem to ignore that for now.) Regards, Aleks