Handling a large database (of ngrams)

Hi Haskellers, I'm unaware of a "good method" or "default way" of handling large datasets to which I need non-sequential (i.e. random) access in Haskell. My use case is linguistic analysis of a ~30GB corpus — the most basic form of quantitative analysis here are ngram based HMMs, which aren't difficult to pull of in Haskell. However, I would need a *large* database of ngrams, possibly into the hundreds of gigabytes. Such a beast will obviously never fit into memory. Moreover, I'll need sort of random access both during training and testing/operation. My initial idea (that I'm currently following up on) is to use Haskell's Berkeley DB wrapper. In this case, I would just stuff all ngrams in one big table (with their counts in an associated column,) and let Berkeley DB do the heavy lifting. One simplifying assumption is, of course, using the flyweight pattern (i.e. having one table of unigrams (that would even fit into memory!) associated with indices, and then just reference the indices in the ngram table.) But this is only going to decrease the size of the db by some factor, not an order of magnitude. Tries come to mind, too. Not only for the unigram table (where they indeed do make a lot of sense,) but also for the ngram table, where they might or might not help. In this case, the cells of the trie would be word indices pointing to the unigram table. Is there a way to serialize tries over a finite (but possibly large) domain to disk and have a semi-speedy random access to them? I'm fully aware that I'll have to wait up to a couple of seconds per lookup, that's OK. I could hack together a primitive implementation of what I need here, but maybe there's something better out there already? Hackage didn't speak to me[1], and #haskell was busy discussing something else, so I hope -cafe can help me :-) Thanks for any hints and pointers! Aleks [1] There's the ngram package on Haskell, but that only queries Google. I actually want to build my own ngram database, because I'm training on a specific corpus, and will possibly have to adapt to different domains.

On 5/21/11 3:56 PM, Aleksandar Dimitrov wrote:
Hi Haskellers,
I'm unaware of a "good method" or "default way" of handling large datasets to which I need non-sequential (i.e. random) access in Haskell.
My use case is linguistic analysis of a ~30GB corpus — the most basic form of quantitative analysis here are ngram based HMMs, which aren't difficult to pull of in Haskell.
I've been working on some n-gram stuff lately, which I'm hoping to put up on Hackage sometime this summer (i.e., in a month or so, after polishing off a few rough edges). The project aims to be suitable for industrial-scale problems, though I'm not sure if it'll make it quite so far as 30GB. Unless you really need to get your hands dirty, the first thing you should do is check out SRILM and see if that can handle it. Assuming SRILM can do it, you'd just need to write some FFI code to call into it from Haskell. This is relatively straightforward, and if it hasn't been done already it'd be a great contribution to the Haskell NLP community. One great thing about this approach is that it's pretty easy to have the SRILM server running on a dedicated machine, so that the memory requirements of the model don't interfere with the memory requirements of your program. (I can give you pointers to Java code doing all this, if you'd like.) One big caveat to bear in mind is that SRILM is not threadsafe, so that'll get in the way of any parallelism or distributed computing on the client side of things. Also, SRILM is hell to install. If you have too much trouble trying to get SRILM to work, there's also the Berkeley LM which is easier to install. I'm not familiar with its inner workings, but it should offer pretty much the same sorts of operations. But, with that in mind I can offer a few pointers for doing it in Haskell. The first one is interning. Assuming the data is already tokenized (or can be made to be), then it should be straightforward to write a streaming program that converts every unique token into an integer (storing the conversion table somewhere for later use). This will reduce the effective corpus size significantly for any natural language--- and dramatically for morphologically poor languages like English. For morphologically rich languages the benefits will be best if you can do morphological segmentation before interning, though whether that's viable depends on your NLP task. It doesn't even need to be complete morphological segmentation, just break on the obvious boundaries in order to get the average word size down to something reasonable. For regular projects, that integerization would be enough, but for your task you'll probably want to spend some time tweaking the codes. In particular, you'll probably have enough word types to overflow the space of Int32/Word32 or even Int64/Word64. (If not, then skip this section and go boldly forward!) Consequently, you won't be able to fit an ID into a machine register, so you'll want to find a good dynamic representation. If you collect unigram counts along the way (or can estimate their frequencies reliably from a subcorpus), then you can use a Huffman code, Shannon--Fano code, or arithmetic code in order to bring the average codeword length down. The vast majority of word tokens belong to very few word types, so this coding will enable you to fit the vast majority of tokens into a single machine word, saving a single bit or two to indicate that you need to fail-over to passing around an array of bits identifying the original token. If you can get away with not doing those codeword compression tricks, then you can also gather all your n-gram statistics along the way (this is what my program does). Or, if you can partition the corpus into a handful of subcorpora, each of which can get by without the compression tricks, then you can run your training code separately on each subcorpus and unify the ID spaces when folding the results together. Other than interning, another major way to save on space is to use tries, as you mentioned. In particular, every tree for n-grams fits into a tree for (n+1)-grams, and since you're storing the (n+1)-gram tree already you might as well have the n-gram tree reuse the overhead. In my experience, unifying all the different n-gram trees has at most a trivial speed cost compared to giving them all dedicated representations. The next question is what sort of trie representation to use. Something based on big-endian patricia trees (e.g., containers:Data.IntMap or bytestring-trie:Data.Trie) is a good place to start. Though, you'll want to make them wider than binary-branching in order to reduce the overhead of the pointers et al. If you're still running into memory issues for storing the whole trained model, then you should look into cache-oblivious tree structures that you can mmap in. If all that still isn't enough, then that's where you start breaking into doing research :) -- Live well, ~wren

Hi Wren, First of all, thanks for your elaborate answer! Your input is very much appreciated! On Sat, May 21, 2011 at 10:42:57PM -0400, wren ng thornton wrote:
I've been working on some n-gram stuff lately, which I'm hoping to put up on Hackage sometime this summer (i.e., in a month or so, after polishing off a few rough edges).
Unfortunately, my current time constraints don't allow me to wait for your code :-( However, research on that system might continue well into the summer (if I can manage to get paid for it…) so I'll keep an eye out for your announcement!
Unless you really need to get your hands dirty, the first thing you should do is check out SRILM and see if that can handle it. Assuming SRILM can do it, you'd just need to write some FFI code to call into it from Haskell. This is relatively straightforward, and if it hasn't been done already it'd be a great contribution to the Haskell NLP community. One great thing about this approach is that it's pretty easy to have the SRILM server running on a dedicated machine, so that the memory requirements of the model don't interfere with the memory requirements of your program. (I can give you pointers to Java code doing all this, if you'd like.) One big caveat to bear in mind is that SRILM is not threadsafe, so that'll get in the way of any parallelism or distributed computing on the client side of things. Also, SRILM is hell to install.
I don't need distributed computing, but I haven't used SRILM based on the fact that I'd have to write a wrapper around it, and because I need to do some custom language model stuff for which I need to write functions that operate on ngram probabilities directly, and I want to try out different machine learning methods to compare their merits for my research. I'll see if SRILM is flexible enough, to do this though.
If you have too much trouble trying to get SRILM to work, there's also the Berkeley LM which is easier to install. I'm not familiar with its inner workings, but it should offer pretty much the same sorts of operations.
Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large data sets? Maybe this is also the wrong list to ask for this kind of question.
But, with that in mind I can offer a few pointers for doing it in Haskell. The first one is interning. Assuming the data is already tokenized (or can be made to be), then it should be straightforward to write a streaming program that converts every unique token into an integer (storing the conversion table somewhere for later use).
This is what I meant by using the flyweight pattern. There's of course also the possibility of computing a hash of every string, but I don't want to deal with hash collisions. While they are largely avoidable (using, say, SHA1) but since that would force a multi-byte index, I don't know if that would help too much, seeing as the average word length isn't dramatically big.
It doesn't even need to be complete morphological segmentation, just break on the obvious boundaries in order to get the average word size down to something reasonable.
I will do my best to avoid having to do this. My current research target is German, which is richer in morphology than English, but not as much as Turkic, Slavic or Uralic languages. In case of a strong inflectional language or even an agglutinating or polysynthetic one, the need for a smarter morphological analysis would arise. Unfortunately, this would have to be in the form of language-specific morphological analysers. Since this is something I would want to do in general anyway, I might write a morphological analyser for German that breaks words down to case and tense markings as well as lemmas, but this isn't the focus of my project, so I'll first try to do without.
For regular projects, that integerization would be enough, but for your task you'll probably want to spend some time tweaking the codes. In particular, you'll probably have enough word types to overflow the space of Int32/Word32 or even Int64/Word64.
I will, most likely, because of the long tail of word frequencies, run into problems with integer space; not only that, but these ultra-rare words aren't really going to be of much use for me. I'm debating using a trie for common words while keeping a backlog of rare words in a mongoDB instance. Rare words could graduate from there if they occur frequently enough, and get accepted into the trie for easier access. Everything left over in the long tail end would just be mapped to RARE<TAG> in the n-grams, where TAG refers to the part of speech tag (since my corpus is pos-tagged.) What number constitutes "rare" and "common" will have to be subject to experimentation. A bloom filter might guard the trie, too. Since BFs can tell you if a certain element is certainly *not* in a collection, I could cut down search operations on the trie itself for all the rare words. AFAICR there's a chapter in RWH on building Bloom filters. There are other ways to cut down the variance of ngrams in large corpora: parsing dates and other cardinalities and subsuming them under one category; using named-entity recognition and subsuming all NEs under one category, etc.
(If not, then skip this section and go boldly forward!) Consequently, you won't be able to fit an ID into a machine register, so you'll want to find a good dynamic representation. If you collect unigram counts along the way (or can estimate their frequencies reliably from a subcorpus), then you can use a Huffman code, Shannon--Fano code, or arithmetic code in order to bring the average codeword length down. The vast majority of word tokens belong to very few word types, so this coding will enable you to fit the vast majority of tokens into a single machine word, saving a single bit or two to indicate that you need to fail-over to passing around an array of bits identifying the original token.
That's kinda similar to what I had in mind (in general :-)
The next question is what sort of trie representation to use. Something based on big-endian patricia trees (e.g., containers:Data.IntMap or bytestring-trie:Data.Trie) is a good place to start. Though, you'll want to make them wider than binary-branching in order to reduce the overhead of the pointers et al. If you're still running into memory issues for storing the whole trained model, then you should look into cache-oblivious tree structures that you can mmap in.
Thanks, this will help me greatly. I didn't know which data structures to use. I will also have to think about (de)serialization because I want to make these translation tables persistent.
If all that still isn't enough, then that's where you start breaking into doing research :)
Indeed I will. I'm going to document this implementation in a fairly technical paper, so if anyone's interested in the result, I can post this on my github or so. I'll probably try to split of the general analysis part and make a hackage package out of it and/or post to haskell-nlp. Regards, Aleks

On 5/22/11 8:40 AM, Aleksandar Dimitrov wrote:
If you have too much trouble trying to get SRILM to work, there's also the Berkeley LM which is easier to install. I'm not familiar with its inner workings, but it should offer pretty much the same sorts of operations.
Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large data sets? Maybe this is also the wrong list to ask for this kind of question.
Well, BerlekelyLM is specifically for n-gram language modeling, it's not a general database. According to the paper I mentioned off-list, the entire Google Web1T corpus (approx 1 trillion word tokens, 4 billion n-gram types) can be fit into 10GB of memory, which is much smaller than SRILM can do. Databases aren't really my area so I couldn't give a good comparison. Though for this scale of data you're going to want to use something specialized for storing n-grams, rather than a general database. There's a lot of redundant structure in n-gram counts and you'll want to take advantage of that.
For regular projects, that integerization would be enough, but for your task you'll probably want to spend some time tweaking the codes. In particular, you'll probably have enough word types to overflow the space of Int32/Word32 or even Int64/Word64.
Again according to Pauls & Klein (2011), Google Web1T has 13.5M word types, which easily fits into 24-bits. That's for English, so morphologically rich languages will be different. I wouldn't expect too many problems for German, unless you have a lot of technical text with a prodigious number of unique compound nouns. Even then I'd be surprised if you went over 2^64 (that'd be reserved for languages like Japanese, Hungarian, Inuit,... if even they'd ever get that bad). -- Live well, ~wren
participants (2)
-
Aleksandar Dimitrov
-
wren ng thornton