Re: [Haskell-cafe] Handling a large database (of ngrams)

23 May 2011

      On 5/22/11 8:40 AM, Aleksandar Dimitrov wrote:
...
...
If you have too much trouble trying to get SRILM to work, there's
also the Berkeley LM which is easier to install. I'm not familiar
with its inner workings, but it should offer pretty much the same
sorts of operations.
Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large
data sets? Maybe this is also the wrong list to ask for this kind of question.
Well, BerlekelyLM is specifically for n-gram language modeling, it's not 
a general database. According to the paper I mentioned off-list, the 
entire Google Web1T corpus (approx 1 trillion word tokens, 4 billion 
n-gram types) can be fit into 10GB of memory, which is much smaller than 
SRILM can do.

Databases aren't really my area so I couldn't give a good comparison. 
Though for this scale of data you're going to want to use something 
specialized for storing n-grams, rather than a general database. There's 
a lot of redundant structure in n-gram counts and you'll want to take 
advantage of that.
...
...
For regular projects, that integerization would be enough, but for
your task you'll probably want to spend some time tweaking the
codes. In particular, you'll probably have enough word types to
overflow the space of Int32/Word32 or even Int64/Word64.
Again according to Pauls & Klein (2011), Google Web1T has 13.5M word 
types, which easily fits into 24-bits. That's for English, so 
morphologically rich languages will be different. I wouldn't expect too 
many problems for German, unless you have a lot of technical text with a 
prodigious number of unique compound nouns. Even then I'd be surprised 
if you went over 2^64 (that'd be reserved for languages like Japanese, 
Hungarian, Inuit,... if even they'd ever get that bad).

-- 
Live well,
~wren

Re: [Haskell-cafe] Handling a large database (of ngrams)

wren ng thornton