
On Wed, Aug 5, 2009 at 6:53 AM, Yitzchak Gale
Hi Gwern,
gwern0 wrote:
...efficiency is an issue.
Here are a few efficiency issues that I noticed in your algorithm after a quick look (none of these have to do with Haskell really):
o ranks sorts the entire set, then discards all but the maximum. It would be better to use maximum or maximumBy. If you are worried about the empty set, check that separately.
OK, I've done that. I removed the swap call as well, and modified approximation to read: xs -> let word = fst . maximumBy (comparing snd) $ xs in Removing the swap, and switching from sort to maximum, cuts about 2 or 3s off the 18-19s runtime on the text of _Dune_.
o You recompute the same rank value repeatedly for every word in each sentence. Use a "let" clause in your list comprehension to do it only once per sentence.
As in? [(word, rank) | (_sentenceId, wrds) <- Map.toList s, let rank = 1 `fidiv` Set.size wrds, word <- Set.toList wrds] This didn't seem to make a noticeable difference.
o approximation rescans the entire corpus after discarding each word. Try to think of a way to recompute only those sentences that contained the word.
Hope this helps, Yitz
Not sure I follow. Seems you're saying that one can filter the corpus for matches, and then do the deletion step. If I fuse the filter and deletion passes into a single map, which looks like this: - let withoutWord = Map.map (Set.delete word) s + let withoutWord = Map.mapMaybe (\x -> if Set.member word x then Just (Set.delete word x) else Nothing) s then I get very incorrect results (although the first one is still correct), which makes me think that the rescan isn't so wasted. -- gwern