Re: [Haskell-beginners] Code help requested

13 Jan 2010

      Am Mittwoch 13 Januar 2010 03:38:46 schrieb Tim Perry:
...
I compiled the original version, Yusaka's version, and a version I wrote
and found the following:
$ time ./Anagram_me < /usr/share/dict/words > /dev/null
real    0m2.197s
user    0m2.040s
sys     0m0.160s
$ time ./Anagram_JoeVanDyke < /usr/share/dict/words > /dev/null
real    0m4.570s
user    0m4.290s
sys     0m0.260s
perry@emperor:~/haskell$ time ./Anagram_Yusaku < /usr/share/dict/words >
/dev/null real    0m1.337s
user    0m1.230s
sys     0m0.100s
From
this, it looks like mine version takes less than half the time of the
original. However, if I run a bigger dictionary (Ubuntu package
wamerican-large instead of wamerican-small) then I'm only about 30%
faster than the original. This makes me think I have some sort of
exponential data structure growth going on.  Here is my version. Can
anyone confirm that data structure growth is the problem with my
approach?  Thanks, Tim
import Data.List as Lst
import Data.Map as Map
-- This version only displays words that have more than two
-- match in the list, and sorts by the words that got the most matches.
-- Can we do the map bit better?
main = do
  input <- getContents
  print $ anagrams $ lines input
anagrams words =
  sorted_anagrams
  where
    sorted_anagrams = sortBy sorter filtered_anagrams
    sorter a b = compare (length b) (length a)
    longEnoughWords = [x | x <- words, length x > 1]
The words are short here, so it's not catastrophic, but

*don't use length unless you really want to know the length*

Here, use (not . null . drop 1) [an input line might be empty, so don't use 
tail], in general, instead of 

length list > k, use not . null $ drop k list;

if you want to check (length list == k),

case drop (k-1) list of
  (_:[]) -> True
 _      -> False

is O(min k (length list)), if there's a slight possibility that the list is 
much longer than k, it's safer.

However, it's unlikely that there are more than 52 one-letter words in the 
word list, so filtering out those shouldn't make it faster.
...
    filtered_anagrams = [x | x <- Map.elems $ foldr insert empty $
wordPairs, length x > 2] where
That's bad. 

Using foldr to construct the map, you must have the whole list from which 
to construct it in the memory at once - since the list takes less memory 
than the map, that is not a real problem, if you run out of memory thus, 
you would anyway - and can start constructing the map only after the entire 
reading is done - this is the real problem. 
You build a nice huge thunk that way, which may blow the stack. And it's 
slow.

foldr is for the cases where you can start returning output before the 
entire list has been consumed, a necessary condition for that is that the 
accumulation function is lazy in its second argument, like (++), (&&), 
(||).
In practically all other cases, you want foldl' (there might be a few cases 
where foldl is what you want, I haven't seen such a case yet, though).
...
         wordPairs =  zip (Prelude.map Lst.sort longEnoughWords)
                                   longEnoughWords
         insert (sorted, original) = insertWith (++) sorted [original]
changing foldr to foldl' and insert to

insert m (sorted,original) = insertWith (++) sorted [original] m

reduces the time on my /usr/share/dict/words from 31 seconds to 8.

The difference is 0.64s vs 0.76s for 40,000 words, 1.17s vs. 1.80s on 
70,000 words, 2.12s vs. 4.22s on 120,000 words.

Re: [Haskell-beginners] Code help requested

Daniel Fischer