
Hi Haskellers, I'm unaware of a "good method" or "default way" of handling large datasets to which I need non-sequential (i.e. random) access in Haskell. My use case is linguistic analysis of a ~30GB corpus — the most basic form of quantitative analysis here are ngram based HMMs, which aren't difficult to pull of in Haskell. However, I would need a *large* database of ngrams, possibly into the hundreds of gigabytes. Such a beast will obviously never fit into memory. Moreover, I'll need sort of random access both during training and testing/operation. My initial idea (that I'm currently following up on) is to use Haskell's Berkeley DB wrapper. In this case, I would just stuff all ngrams in one big table (with their counts in an associated column,) and let Berkeley DB do the heavy lifting. One simplifying assumption is, of course, using the flyweight pattern (i.e. having one table of unigrams (that would even fit into memory!) associated with indices, and then just reference the indices in the ngram table.) But this is only going to decrease the size of the db by some factor, not an order of magnitude. Tries come to mind, too. Not only for the unigram table (where they indeed do make a lot of sense,) but also for the ngram table, where they might or might not help. In this case, the cells of the trie would be word indices pointing to the unigram table. Is there a way to serialize tries over a finite (but possibly large) domain to disk and have a semi-speedy random access to them? I'm fully aware that I'll have to wait up to a couple of seconds per lookup, that's OK. I could hack together a primitive implementation of what I need here, but maybe there's something better out there already? Hackage didn't speak to me[1], and #haskell was busy discussing something else, so I hope -cafe can help me :-) Thanks for any hints and pointers! Aleks [1] There's the ngram package on Haskell, but that only queries Google. I actually want to build my own ngram database, because I'm training on a specific corpus, and will possibly have to adapt to different domains.