
Am Mittwoch 13 Januar 2010 03:38:46 schrieb Tim Perry:
I compiled the original version, Yusaka's version, and a version I wrote and found the following:
$ time ./Anagram_me < /usr/share/dict/words > /dev/null real 0m2.197s user 0m2.040s sys 0m0.160s
$ time ./Anagram_JoeVanDyke < /usr/share/dict/words > /dev/null real 0m4.570s user 0m4.290s sys 0m0.260s
perry@emperor:~/haskell$ time ./Anagram_Yusaku < /usr/share/dict/words > /dev/null real 0m1.337s user 0m1.230s sys 0m0.100s
From this, it looks like mine version takes less than half the time of the original. However, if I run a bigger dictionary (Ubuntu package wamerican-large instead of wamerican-small) then I'm only about 30% faster than the original. This makes me think I have some sort of exponential data structure growth going on. Here is my version. Can anyone confirm that data structure growth is the problem with my approach? Thanks, Tim
import Data.List as Lst import Data.Map as Map
-- This version only displays words that have more than two -- match in the list, and sorts by the words that got the most matches.
-- Can we do the map bit better?
main = do input <- getContents print $ anagrams $ lines input
anagrams words = sorted_anagrams where sorted_anagrams = sortBy sorter filtered_anagrams sorter a b = compare (length b) (length a) longEnoughWords = [x | x <- words, length x > 1]
The words are short here, so it's not catastrophic, but *don't use length unless you really want to know the length* Here, use (not . null . drop 1) [an input line might be empty, so don't use tail], in general, instead of length list > k, use not . null $ drop k list; if you want to check (length list == k), case drop (k-1) list of (_:[]) -> True _ -> False is O(min k (length list)), if there's a slight possibility that the list is much longer than k, it's safer. However, it's unlikely that there are more than 52 one-letter words in the word list, so filtering out those shouldn't make it faster.
filtered_anagrams = [x | x <- Map.elems $ foldr insert empty $ wordPairs, length x > 2] where
That's bad. Using foldr to construct the map, you must have the whole list from which to construct it in the memory at once - since the list takes less memory than the map, that is not a real problem, if you run out of memory thus, you would anyway - and can start constructing the map only after the entire reading is done - this is the real problem. You build a nice huge thunk that way, which may blow the stack. And it's slow. foldr is for the cases where you can start returning output before the entire list has been consumed, a necessary condition for that is that the accumulation function is lazy in its second argument, like (++), (&&), (||). In practically all other cases, you want foldl' (there might be a few cases where foldl is what you want, I haven't seen such a case yet, though).
wordPairs = zip (Prelude.map Lst.sort longEnoughWords) longEnoughWords insert (sorted, original) = insertWith (++) sorted [original]
changing foldr to foldl' and insert to insert m (sorted,original) = insertWith (++) sorted [original] m reduces the time on my /usr/share/dict/words from 31 seconds to 8. The difference is 0.64s vs 0.76s for 40,000 words, 1.17s vs. 1.80s on 70,000 words, 2.12s vs. 4.22s on 120,000 words.