
Am Dienstag 12 Januar 2010 02:22:49 schrieb Joe Van Dyk:
I've written two versions of the same program, one in ruby and one in haskell. Given words on stdin, find all the anagrams in those words. For nicer display, we're only going to display the top 3 results.
I'm obviously new to haskell. The ruby version runs about 5x as fast on a large file. How can I improve the haskell version?
# Ruby version input = STDIN.read.split("\n") result = Hash.new([]) input.each do |word| sorted_word = word.split('').sort.join result[sorted_word] += [word] end values = result.values.sort { |a, b| b.size <=> a.size } p values[0..3]
# Haskell version import List import qualified Data.Map as Map
-- Given as stdin -- presents -- serpents -- no -- on -- whatever -- Expected Output: -- [["serpents","presents"],["on","no"]]
-- This version only displays words that have more than one -- match in the list, and sorts by the words that got the most matches.
-- Can we do the map bit better?
main = do input <- getContents print $ anagrams $ lines input
anagrams words = sorted_anagrams where sorted_anagrams = sortBy sorter filtered_anagrams sorter a b = compare (length b) (length a) filtered_anagrams = Map.elems $ Map.filter filter_function all_anagrams filter_function words = length words > 1 all_anagrams = do_anagrams words Map.empty do_anagrams [] result = result do_anagrams words result = do_anagrams (tail words)
Here be dragons. unionWith is O(n+m) where n and m are the sizes of the two maps, the insert variants are O(log n). So this may be quadratic [actually, if the second map in unionWith is a singleton, it behaves better, but it's still much slower than inserts] (and very lazy, which means that without optimisations, all the unions form a giant thunk which overflows the stack for large enough input), using all_anagrams = foldl' (\m w -> Map.insertWith' (++) (sort w) [w] m) Map.empty words , you get an O(n*log n) algorithm with sufficient strictness to not blow the stack. For $ wc -l /usr/share/dict/words 380645 /usr/share/dict/words that is a heck of a difference.
(Map.unionWith (++)
Don't use Map.fromList [(key,value)], use Map.singleton key value instead.
(Map.fromList [(sorted_current_word, [current_word])]) result) where current_word = head words sorted_current_word = sort current_word
While the original version got a stack overflow without optimisations, it ran with -O2, but took a *lot* of memory and was ~10% slower than the Ruby version. But it spent 68% of the time garbage collecting. Change all_anagrams as above, and it uses reasonable memory (about 30% more than Ruby if left to choose how much to use, it can run on less than Ruby with +RTS -MxM, but that of course increases GC times a bit) and takes about a third of the time of the Ruby version (*without optimisations*, -O2 makes only a small difference [~6%] here). $ time ruby ./Anagrams.rb < /usr/share/dict/words > /dev/null 26.27user 0.21system 0:26.48elapsed 100%CPU $ ./AnagramsH +RTS -sstderr < /usr/share/dict/words > /dev/null ./AnagramsH +RTS -sstderr 1,807,965,184 bytes allocated in the heap 577,083,904 bytes copied during GC 73,277,232 bytes maximum residency (12 sample(s)) 858,644 bytes maximum slop 166 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 3437 collections, 0 parallel, 2.47s, 2.62s elapsed Generation 1: 12 collections, 0 parallel, 1.07s, 1.25s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 5.16s ( 5.19s elapsed) GC time 3.54s ( 3.86s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.70s ( 9.05s elapsed) %GC time 40.7% (42.7% elapsed) Alloc rate 350,629,823 bytes per MUT second Productivity 59.3% of total user, 57.0% of total elapsed