
On 5/22/11 8:40 AM, Aleksandar Dimitrov wrote:
If you have too much trouble trying to get SRILM to work, there's also the Berkeley LM which is easier to install. I'm not familiar with its inner workings, but it should offer pretty much the same sorts of operations.
Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large data sets? Maybe this is also the wrong list to ask for this kind of question.
Well, BerlekelyLM is specifically for n-gram language modeling, it's not a general database. According to the paper I mentioned off-list, the entire Google Web1T corpus (approx 1 trillion word tokens, 4 billion n-gram types) can be fit into 10GB of memory, which is much smaller than SRILM can do. Databases aren't really my area so I couldn't give a good comparison. Though for this scale of data you're going to want to use something specialized for storing n-grams, rather than a general database. There's a lot of redundant structure in n-gram counts and you'll want to take advantage of that.
For regular projects, that integerization would be enough, but for your task you'll probably want to spend some time tweaking the codes. In particular, you'll probably have enough word types to overflow the space of Int32/Word32 or even Int64/Word64.
Again according to Pauls & Klein (2011), Google Web1T has 13.5M word types, which easily fits into 24-bits. That's for English, so morphologically rich languages will be different. I wouldn't expect too many problems for German, unless you have a lot of technical text with a prodigious number of unique compound nouns. Even then I'd be surprised if you went over 2^64 (that'd be reserved for languages like Japanese, Hungarian, Inuit,... if even they'd ever get that bad). -- Live well, ~wren