Re: [Haskell-beginners] Code help requested

12 Jan 2010

      Am Dienstag 12 Januar 2010 02:22:49 schrieb Joe Van Dyk:
...
I've written two versions of the same program, one in ruby and one in
haskell.  Given words on stdin, find all the anagrams in those words.
For nicer display, we're only going to display the top 3 results.
I'm obviously new to haskell.  The ruby version runs about 5x as fast
on a large file.   How can I improve the haskell version?
http://gist.github.com/274774
# Ruby version
input = STDIN.read.split("\n")
result = Hash.new([])
input.each do |word|
  sorted_word = word.split('').sort.join
  result[sorted_word] += [word]
end
values = result.values.sort { |a, b| b.size <=> a.size }
p values[0..3]
# Haskell version
import List
import qualified Data.Map as Map
-- Given as stdin
-- presents
-- serpents
-- no
-- on
-- whatever
-- Expected Output:
-- [["serpents","presents"],["on","no"]]
-- This version only displays words that have more than one
-- match in the list, and sorts by the words that got the most matches.
-- Can we do the map bit better?
main = do
  input <- getContents
  print $ anagrams $ lines input
anagrams words =
  sorted_anagrams
  where
    sorted_anagrams = sortBy sorter filtered_anagrams
    sorter a b = compare (length b)  (length a)
    filtered_anagrams = Map.elems $ Map.filter filter_function
all_anagrams filter_function words = length words > 1
    all_anagrams = do_anagrams words Map.empty
    do_anagrams [] result = result
    do_anagrams words result = do_anagrams
                                 (tail words)
Here be dragons.
unionWith is O(n+m) where n and m are the sizes of the two maps, the insert 
variants are O(log n). So this may be quadratic [actually, if the second 
map in unionWith is a singleton, it behaves better, but it's still much 
slower than inserts] (and very lazy, which means that without 
optimisations, all the unions form a giant thunk which overflows the stack 
for large enough input), using

all_anagrams 
    = foldl' (\m w -> Map.insertWith' (++) (sort w) [w] m) Map.empty words

, you get an O(n*log n) algorithm with sufficient strictness to not blow 
the stack. For

$ wc -l /usr/share/dict/words
380645 /usr/share/dict/words

that is a heck of a difference.
...
(Map.unionWith
                                   (++)
Don't use Map.fromList [(key,value)], use Map.singleton key value instead.
...
(Map.fromList
[(sorted_current_word, [current_word])])
                                   result)
      where
        current_word = head words
        sorted_current_word = sort current_word
While the original version got a stack overflow without optimisations, it 
ran with -O2, but took a *lot* of memory and was ~10% slower than the Ruby 
version. But it spent 68% of the time garbage collecting.

Change all_anagrams as above, and it uses reasonable memory (about 30% more 
than Ruby if left to choose how much to use, it can run on less than Ruby 
with +RTS -MxM, but that of course increases GC times a bit) and takes 
about a third of the time of the Ruby version (*without optimisations*, -O2 
makes only a small difference [~6%] here).

$ time ruby ./Anagrams.rb < /usr/share/dict/words > /dev/null
26.27user 0.21system 0:26.48elapsed 100%CPU

$ ./AnagramsH +RTS -sstderr < /usr/share/dict/words  > /dev/null
./AnagramsH +RTS -sstderr
   1,807,965,184 bytes allocated in the heap
     577,083,904 bytes copied during GC
      73,277,232 bytes maximum residency (12 sample(s))
         858,644 bytes maximum slop
             166 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0:  3437 collections,     0 parallel,  2.47s,  2.62s elapsed
  Generation 1:    12 collections,     0 parallel,  1.07s,  1.25s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    5.16s  (  5.19s elapsed)
  GC    time    3.54s  (  3.86s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    8.70s  (  9.05s elapsed)

  %GC time      40.7%  (42.7% elapsed)

  Alloc rate    350,629,823 bytes per MUT second

  Productivity  59.3% of total user, 57.0% of total elapsed

Re: [Haskell-beginners] Code help requested

Daniel Fischer