Getting highest sum of list elements with Map

(Full source attached; or alternately, grab it: darcs get http://community.haskell.org/~gwern/hcorpus ) So I have this little program which hopefully will help me learn French by picking, from a textfile/corpus the most 'valuable' word to learn - the word that will complete or bring closer to completion the most sentences in that corpus. It's a relatively straightforward problem, but efficiency is an issue. It can handle up to a book in length just fine, but at 10x that, it eats all my RAM and seems as if it will run indefinitely. (I haven't let it run long enough to find out.) The profiling fingers the method I would expect, one 'ranks' at ~80% of runtime, but it doesn't get any more specific than that, alas. The inner loop goes: fidiv :: (Integral a, Fractional b) => a -> a -> b fidiv = (/) `on` fromIntegral swap :: (a, b) -> (b, a) swap = uncurry (flip (,)) ranks :: (NFData v, Ord k, Ord v) => Map.Map k (Set.Set v) -> Maybe (Rational, v) ranks s = listToMaybe . sortBy (flip compare) . pmap swap . Map.toList . Map.fromListWith (+) $ [(word, 1 `fidiv` Set.size wrds) | (_sentenceId, wrds) <- Map.toList s , word <- Set.toList wrds] approximation :: (NFData v, Ord k, Ord v) => Map.Map k (Set.Set v) -> Int -> [v] approximation _ 0 = [] approximation s n = case ranks s of Nothing -> [] Just (_value, word) -> let withoutWord = Map.map (Set.delete word) s in word : approximation withoutWord (n-1) The inside of 'ranks' is generating a list which will look like [("foo", 0.2), ("bar", 0.0), ("foo", 0.5)...]; we insert into an empty Map, using the word as the key, and incrementing the existing value (so with that example we'd have {"foo" -> 0.7, "bar" -> 0.0}, and then we immediately convert *back* to a list, swap the tuple elements, and sort the whole blessed thing. I can't help but think that this is rather inefficient, and not the most straightforward way to go about it. But poring through Data.Map, I can't figure out what the right thing to do is. Obviously I only want to pull out the key/string with the maximum Rational value, but all the min/max functions work only on the *key*, not the value. Am I missing something, or am I using the wrong data structure, or what? -- gwern

Hi Gwern, gwern0 wrote:
...efficiency is an issue.
Here are a few efficiency issues that I noticed in your algorithm after a quick look (none of these have to do with Haskell really): o ranks sorts the entire set, then discards all but the maximum. It would be better to use maximum or maximumBy. If you are worried about the empty set, check that separately. o You recompute the same rank value repeatedly for every word in each sentence. Use a "let" clause in your list comprehension to do it only once per sentence. o approximation rescans the entire corpus after discarding each word. Try to think of a way to recompute only those sentences that contained the word. Hope this helps, Yitz

On Wed, Aug 5, 2009 at 6:53 AM, Yitzchak Gale
Hi Gwern,
gwern0 wrote:
...efficiency is an issue.
Here are a few efficiency issues that I noticed in your algorithm after a quick look (none of these have to do with Haskell really):
o ranks sorts the entire set, then discards all but the maximum. It would be better to use maximum or maximumBy. If you are worried about the empty set, check that separately.
OK, I've done that. I removed the swap call as well, and modified approximation to read: xs -> let word = fst . maximumBy (comparing snd) $ xs in Removing the swap, and switching from sort to maximum, cuts about 2 or 3s off the 18-19s runtime on the text of _Dune_.
o You recompute the same rank value repeatedly for every word in each sentence. Use a "let" clause in your list comprehension to do it only once per sentence.
As in? [(word, rank) | (_sentenceId, wrds) <- Map.toList s, let rank = 1 `fidiv` Set.size wrds, word <- Set.toList wrds] This didn't seem to make a noticeable difference.
o approximation rescans the entire corpus after discarding each word. Try to think of a way to recompute only those sentences that contained the word.
Hope this helps, Yitz
Not sure I follow. Seems you're saying that one can filter the corpus for matches, and then do the deletion step. If I fuse the filter and deletion passes into a single map, which looks like this: - let withoutWord = Map.map (Set.delete word) s + let withoutWord = Map.mapMaybe (\x -> if Set.member word x then Just (Set.delete word x) else Nothing) s then I get very incorrect results (although the first one is still correct), which makes me think that the rescan isn't so wasted. -- gwern

On Wed, Aug 5, 2009 at 10:51 AM,
(Full source attached; or alternately, grab it: darcs get http://community.haskell.org/~gwern/hcorpus )
So I have this little program which hopefully will help me learn French by
Probably off-topic, but also, I'm willing to help anyone learning French, for example by answering questions in French, or about French, or just writing about anything in French, or translating texts for you. Bon courage, David.
participants (4)
-
david48
-
Gwern Branwen
-
gwern0@gmail.com
-
Yitzchak Gale