
On January 11, 2010 5:22:49 pm Joe Van Dyk wrote:
I've written two versions of the same program, one in ruby and one in haskell. Given words on stdin, find all the anagrams in those words. For nicer display, we're only going to display the top 3 results.
I'm obviously new to haskell. The ruby version runs about 5x as fast on a large file. How can I improve the haskell version?
# Ruby version input = STDIN.read.split("\n") result = Hash.new([]) input.each do |word| sorted_word = word.split('').sort.join result[sorted_word] += [word] end values = result.values.sort { |a, b| b.size <=> a.size } p values[0..3]
# Haskell version import List import qualified Data.Map as Map
-- Given as stdin -- presents -- serpents -- no -- on -- whatever -- Expected Output: -- [["serpents","presents"],["on","no"]]
-- This version only displays words that have more than one -- match in the list, and sorts by the words that got the most matches.
-- Can we do the map bit better?
main = do input <- getContents print $ anagrams $ lines input
anagrams words = sorted_anagrams where sorted_anagrams = sortBy sorter filtered_anagrams sorter a b = compare (length b) (length a) filtered_anagrams = Map.elems $ Map.filter filter_function all_anagrams filter_function words = length words > 1 all_anagrams = do_anagrams words Map.empty do_anagrams [] result = result do_anagrams words result = do_anagrams (tail words) (Map.unionWith (++) (Map.fromList [(sorted_current_word, [current_word])]) result) where current_word = head words sorted_current_word = sort current_word
These are the numbers I got once I modified your Haskell program to only print out 4 results the way the Ruby program does. Your Haskell version: ~10.0 s My Haskell version: ~2.5 s Your Ruby version (Ruby 1.8): ~4.6 s Your Ruby version (Ruby 1.9): ~4.2 s This is my version of your program: import Control.Monad (liftM) import Data.Function (on) import Data.List (sort, sortBy) import qualified Data.ByteString.Char8 as B import qualified Data.Map as Map -- Given as stdin -- presents -- serpents -- no -- on -- whatever -- Expected Output: -- [["serpents","presents"],["on","no"]] main = do input <- liftM B.lines B.getContents let wordMap = buildMap $ map B.unpack input print $ take 4 (listAnagrams wordMap) buildMap words = let entries = map (\x -> (sort x, [x])) words in Map.fromListWith (++) entries listAnagrams wordMap = let anagrams = (Map.elems . Map.filter (\x -> length x > 1)) wordMap in sortBy (flip (compare `on` length)) anagrams I found that the performance improved when I used ByteStrings to read the input and then unpacked to regular strings before creating the Map. For some reason, using BytesStrings everywhere made the program slower. Can anyone tell me why? Dave