Re: [Haskell-cafe] How on Earth Do You Reason about Space?

31 May 2011

      On Tue, May 31, 2011 at 11:43:27AM -0700, Brandon Moore wrote:
...
I can't reproduce heap usage growing with the
size of the input file.
I made a word list from Project Gutenberg's
copy of "War and Peace" by
tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt
Using 1, 25, or 1000 repetitions of this ~3MB wordlist
shows about 100MB of address space used according
to top, and no more than 5MB or so of haskell heap
used according to the memory profile, with a flat
memory profile.
This will lead to very small variance in your data. The effect I'm experiencing
is so small in this case, that it's barely observable (but it is, see below.)
...
Is your memory usage growing with the size of the input
file, or the size of the histogram?
While the histogram is naturally growing with the size of the input file, memory
usage seems to be proportional mainly to the histogram. It is clear that, due to
the effect of the Long Tail, the histogram is going to constantly grow in a real
setting, as opposed to just replicating the same data.  In your test case, the
histogram is *not* growing with the size of the input file.

The memory usage is proportional to the histogram, which is proportional to the
file size. That is not the problem. The problem is, that, compared to the size
of the histogram, the memory consumption is *inadequately* high. Here's some
more data, using your Tolstoi example:

du file.txt
344M	file.txt

<>

As you can see, memory residency is at 8 MB avg, 10 MB max. This is the
War&Peace file, replicated 100 times. Let's look at the result for the file
*without* first replicating it 100 times:

du words.txt
3.0M	words.txt

<>

4.8MB avg, 9.1 MB max. It seems input file size *does* have an effect of sorts,
but it's negligible. What is more interesting is this: the file is a whopping
3MB big. How on earth does the program consume almost 5 MB *on average*? This is
*not* constant memory usage. This is memory usage trivial enough to not be worth
a fuss for small inputs, but unfortunately, it gets larger as soon as you
increase the file size (in a realistic fashion: i.e. you'll also increase the
unigram count.)
...
I was worried data sharing might mean your keys
retain entire 64K chunks of the input. However, it
seems enumLines depends on the StringLike ByteString
instance, which just converts to and from String.
Ouch, that sounds like something worth fixing.
...
The other thing that occurs to me is that the total size of
your keys would also be approximately the size of the input
file if you were using plain text without each word split onto
a separate line.
Well, I am not. The corpus is a word-per-line corpus, I'm reading a word per
line, and adding that to my hash map. This should never result in a data
structure even close to the size of the original corpus.

It could be, in a very unrealistic worst case scenario. But even a corpus of
30GB of Poe and Heidegger isn't going to make that happen. Furthermore, mine is not
such a scenario at all. As I said, if you reduce the corpus to a set of words
(i.e. a set of unigrams) you get a 40MB file from a 1.4GB corpus. Why is it,
that in order to create that 40MB file from a 1.4GB corpus, my trivial little
program needs somewhere north of 6-8 GB of RAM?

In this trivial example for War and Peace, why is it that in order to create the
unigram table for War and Peace, which ends up being a mere 201KB big, we're
chomping through 5MB on average, and nearly 10MB max? That's at least 25 times
more than we actually *should* have (yes, I know that the RTS is somewhere
there, too, but I think it's not a problem to ignore that for now.)

Regards,
Aleks

Re: [Haskell-cafe] How on Earth Do You Reason about Space?

Aleksandar Dimitrov