
On 10/9/10 2:02 AM, Alex Rozenshteyn wrote:
This came up as I was doing homework for natural language processing.
I'm constructing a trigram model from training data, but I also need the bigram and unigram counts. I could, for each triple of characters, add the 3 tuple to a trigram map and increment its count (I know I'm not actually mutating anything), add the last two letters to a bigram map, and add the last character to a unigram map.
But this looks very much like a tree... each node contains the character (the key) and list of characters that can appear after it. (I think this is a trie, but I don't know very much about tries.) The problem is that lookup is more horribly inefficient than I can stand, if I use lists.
Yep, that's a trie :) For what it's worth, in the HMM tagger I'm working on now, I'm using something to the effect of: type IntTrie a = IntMap (a, IntTrie a) Of course that's not a valid type synonym and I'm not really using infinite depth (though there's no reason not to allow it), but that's the idea. The Ints are gotten by interning the word/tag strings and I actually do a lot of newtype wrapping to ensure that my IDs for words, tags, etc and their maps are all kept distinct. In order to update your counts during training you can use things like: succZ z = alter (Just . succ . maybe 0 id) z succYZ y z = alter (Just . (succ *** succZ z) . maybe (0,empty) id) y succYZ x y z = alter (Just . (succ *** succYZ y z) . maybe (0,empty) id) x And to look things up you can use things like: lookupZ z = maybe 0 id . fst . lookup z lookupYZ y z = lookupZ z . snd <=< lookup y lookupXYZ x y z = lookupYZ y z . snd <=< lookup x Clearly there's a lot of boilerplate there, but it works and it's very efficient. Ironically, combining the unigram, bigram, and trigram tables actually decreases performance (slightly) on my benchmarks; but I'm keeping them combined in order to reduce memory pressure. You can also take a look at bytestring-trie[1] which is very similar and provides a nicer interface. Unfortunately, there's currently a big performance hit for using bytestring-trie, which I'm in the process of tracking down. Most of it is probably because it's Word8-based instead of Int-based, but I'd like to be sure. bytestring-trie does expose the trie-ness (more to come in the HEAD version). Of course, for HMM stuff, you don't really take advantage of the trie structure, so there's nothing special about using a trie instead of a Map NGram where, data NGram = Unigram {-# UNPACK #-} !Int | Bigram {-# UNPACK #-} !Int {-# UNPACK #-} !Int | Trigram {-# UNPACK #-} !Int {-# UNPACK #-} !Int {-# UNPACK #-} !Int deriving (Eq, Ord, Read, Show) Expanding that out to three different IntMaps or the nested IntMaps is just a performance improvement, which I'm sure won't be important for your homework (i.e., it's not asymptotically significant). [1] http://hackage.haskell.org/package/bytestring-trie -- Live well, ~wren