
(Full source attached; or alternately, grab it: darcs get http://community.haskell.org/~gwern/hcorpus ) So I have this little program which hopefully will help me learn French by picking, from a textfile/corpus the most 'valuable' word to learn - the word that will complete or bring closer to completion the most sentences in that corpus. It's a relatively straightforward problem, but efficiency is an issue. It can handle up to a book in length just fine, but at 10x that, it eats all my RAM and seems as if it will run indefinitely. (I haven't let it run long enough to find out.) The profiling fingers the method I would expect, one 'ranks' at ~80% of runtime, but it doesn't get any more specific than that, alas. The inner loop goes: fidiv :: (Integral a, Fractional b) => a -> a -> b fidiv = (/) `on` fromIntegral swap :: (a, b) -> (b, a) swap = uncurry (flip (,)) ranks :: (NFData v, Ord k, Ord v) => Map.Map k (Set.Set v) -> Maybe (Rational, v) ranks s = listToMaybe . sortBy (flip compare) . pmap swap . Map.toList . Map.fromListWith (+) $ [(word, 1 `fidiv` Set.size wrds) | (_sentenceId, wrds) <- Map.toList s , word <- Set.toList wrds] approximation :: (NFData v, Ord k, Ord v) => Map.Map k (Set.Set v) -> Int -> [v] approximation _ 0 = [] approximation s n = case ranks s of Nothing -> [] Just (_value, word) -> let withoutWord = Map.map (Set.delete word) s in word : approximation withoutWord (n-1) The inside of 'ranks' is generating a list which will look like [("foo", 0.2), ("bar", 0.0), ("foo", 0.5)...]; we insert into an empty Map, using the word as the key, and incrementing the existing value (so with that example we'd have {"foo" -> 0.7, "bar" -> 0.0}, and then we immediately convert *back* to a list, swap the tuple elements, and sort the whole blessed thing. I can't help but think that this is rather inefficient, and not the most straightforward way to go about it. But poring through Data.Map, I can't figure out what the right thing to do is. Obviously I only want to pull out the key/string with the maximum Rational value, but all the min/max functions work only on the *key*, not the value. Am I missing something, or am I using the wrong data structure, or what? -- gwern