On Wed, Jun 1, 2011 at 12:55 AM, Aleksandar Dimitrov <aleks.dimitrov@googlemail.com> wrote:
On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote:

> None of these leak space for me (all compiled with ghc-7.0.3 -O2).
> Performance was pretty comparable for every version, although Aleksander's
> original did seem to have a very small edge.

How big were your input corpora?

Today I was using multiple copies of War & Peace, as Brandon specified.  Total size is about 90M.
 

So it seems that I can't get rid of a factor of around 3x the input file size.
Luckily, the dependency seems to be linear. Here's some profiling:

<<ghc: 30478883712 bytes, 57638 GCs, 41925397/143688744 avg/max bytes residency (189 samples), 322M in use, 0.00 INIT (0.00 elapsed), 23.73 MUT (24.94 elapsed), 26.71 GC (27.10 elapsed) :ghc>>
../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr  50.44s user 1.50s system 99% cpu 52.064 total

ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.)

Redirecting the program's output to a file will yield a mere 2.2M for the data
gathered by the above script. Since those 2.2M of data are all I care about, why
do I need so much more RAM to compute them?

Are my demands unreasonable?

I think the issue is data sharing, as Brandon mentioned.  A bytestring consists of an offset, length, and a pointer.  You're using a chunk size of 64k, which means the generated bytestrings all have a pointer to that 64k of data.  Suppose there's one new word in that 64k, and it's near the beginning of the chunk.  Even though the word may only be a few characters long, it'll reference the entire chunk and keep it from being GC'd.

There are a few solutions to this.  The first is to make a copy of the bytestring so only the required data is retained.  In my experiments this wasn't helpful, but it would depend on your corpus.  The second is to start with smaller chunks.  Using a chunk size of 1024 worked fairly well for me.  If your corpus is similar to natural language, I think it'll probably work better for you as well.

Note that Johan's Ngram code also only keeps the minimum required data, giving it a good memory profile.   I didn't notice this last night because I was testing with different data, and unfortunately the peculiar distribution of that data masked this problem.

John L