[Haskell-cafe] Handling a large database (of ngrams)

21 May 2011

      Hi Haskellers,

I'm unaware of a "good method" or "default way" of handling large datasets to
which I need non-sequential (i.e. random) access in Haskell.

My use case is linguistic analysis of a ~30GB corpus — the most basic form of
quantitative analysis here are ngram based HMMs, which aren't difficult to pull
of in Haskell.

However, I would need a *large* database of ngrams, possibly into the hundreds
of gigabytes. Such a beast will obviously never fit into memory. Moreover, I'll
need sort of random access both during training and testing/operation.

My initial idea (that I'm currently following up on) is to use Haskell's
Berkeley DB wrapper. In this case, I would just stuff all ngrams in one big
table (with their counts in an associated column,) and let Berkeley DB do the
heavy lifting.

One simplifying assumption is, of course, using the flyweight pattern (i.e.
having one table of unigrams (that would even fit into memory!) associated with
indices, and then just reference the indices in the ngram table.) But this is
only going to decrease the size of the db by some factor, not an order of
magnitude.

Tries come to mind, too. Not only for the unigram table (where they indeed do
make a lot of sense,) but also for the ngram table, where they might or might
not help. In this case, the cells of the trie would be word indices pointing to
the unigram table. Is there a way to serialize tries over a finite (but possibly
large) domain to disk and have a semi-speedy random access to them?

I'm fully aware that I'll have to wait up to a couple of seconds per lookup,
that's OK. I could hack together a primitive implementation of what I need here,
but maybe there's something better out there already? Hackage didn't speak to
me[1], and #haskell was busy discussing something else, so I hope -cafe can help me :-)

Thanks for any hints and pointers!
Aleks

[1] There's the ngram package on Haskell, but that only queries Google. I
actually want to build my own ngram database, because I'm training on a specific
corpus, and will possibly have to adapt to different domains.

[Haskell-cafe] Handling a large database (of ngrams)

Aleksandar Dimitrov