
I compiled the original version, Yusaka's version, and a version I wrote and found the following:
$ time ./Anagram_me < /usr/share/dict/words > /dev/null
real 0m2.197s
user 0m2.040s
sys 0m0.160s
$ time ./Anagram_JoeVanDyke < /usr/share/dict/words > /dev/null
real 0m4.570s
user 0m4.290s
sys 0m0.260s
perry@emperor:~/haskell$ time ./Anagram_Yusaku < /usr/share/dict/words > /dev/null
real 0m1.337s
user 0m1.230s
sys 0m0.100s
From
this, it looks like mine version takes less than half the time of the
original. However, if I run a bigger dictionary (Ubuntu package
wamerican-large instead of wamerican-small) then I'm only about 30%
faster than the original. This makes me think I have some sort of
exponential data structure growth going on. Here is my version. Can anyone confirm that data
structure growth is the problem with my approach? Thanks, Tim
import Data.List as Lst
import Data.Map as Map
-- This version only displays words that have more than two
-- match in the list, and sorts by the words that got the most matches.
-- Can we do the map bit better?
main = do
input <- getContents
print $ anagrams $ lines input
anagrams words =
sorted_anagrams
where
sorted_anagrams = sortBy sorter filtered_anagrams
sorter a b = compare (length b) (length a)
longEnoughWords = [x | x <- words, length x > 1]
filtered_anagrams = [x | x <- Map.elems $ foldr insert empty $ wordPairs, length x > 2]
where
wordPairs = zip (Prelude.map Lst.sort longEnoughWords) longEnoughWords
insert (sorted, original) = insertWith (++) sorted [original]
----- Original Message ----
From: Daniel Fischer
These are the numbers I got once I modified your Haskell program to only print out 4 results the way the Ruby program does.
Your Haskell version: ~10.0 s My Haskell version: ~2.5 s Your Ruby version (Ruby 1.8): ~4.6 s Your Ruby version (Ruby 1.9): ~4.2 s
I sincerely hope your input file is smaller than mine :)
This is my version of your program:
import Control.Monad (liftM) import Data.Function (on) import Data.List (sort, sortBy) import qualified Data.ByteString.Char8 as B import qualified Data.Map as Map
-- Given as stdin -- presents -- serpents -- no -- on -- whatever -- Expected Output: -- [["serpents","presents"],["on","no"]]
main = do input <- liftM B.lines B.getContents let wordMap = buildMap $ map B.unpack input print $ take 4 (listAnagrams wordMap)
buildMap words = let entries = map (\x -> (sort x, [x])) words in Map.fromListWith (++) entries
listAnagrams wordMap = let anagrams = (Map.elems . Map.filter (\x -> length x > 1)) wordMap in sortBy (flip (compare `on` length)) anagrams
I found that the performance improved when I used ByteStrings to read the input and then unpacked to regular strings before creating the Map. For some reason, using BytesStrings everywhere made the program slower. Can anyone tell me why?
Yes. ByteString's sort is a bucket-sort. It allocates an array of 256*sizeof(CSize) bytes and counts the occurrences of each character. That's fine for long ByteStrings, but for short ByteStrings like those we consider here, allocating a bucket-array of 1K or 2K is incredibly much. Sorting plain Strings is faster (not very much, though) and uses (much) less memory if they are short. You can further speed up your programme if you put lists of ByteStrings in your Map (less memory, less GC) and unpack them only for sorting (and finally for output): main = do input <- liftM B.lines B.getContents let wordMap = buildMap input print $ take 4 (listAnagrams wordMap) buildMap words = let entries = map (\x -> (B.pack . sort $ B.unpack x, [x])) words in Map.fromListWith (++) entries listAnagrams wordMap = let anagrams = (Map.elems . Map.filter (\x -> length x > 1)) wordMap -- a small speedup can be obtained by not using length: -- Map.filter (not . null . drop 1) -- or Map.filter (\l -> case l of { (_:_:_) -> True; _ -> False }) -- if there are many long lists in the map, the speedup will become -- significant in map (map B.unpack) $ sortBy (flip (compare `on` length)) anagrams Yours: $ ./DFAnagrams +RTS -sstderr < /usr/share/dict/words > /dev/null ./DFAnagrams +RTS -sstderr 1,218,862,708 bytes allocated in the heap 544,113,420 bytes copied during GC 98,018,856 bytes maximum residency (10 sample(s)) 768,552 bytes maximum slop 211 MB total memory in use (2 MB lost due to fragmentation) Generation 0: 2315 collections, 0 parallel, 1.98s, 2.01s elapsed Generation 1: 10 collections, 0 parallel, 1.28s, 1.53s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 3.60s ( 3.61s elapsed) GC time 3.26s ( 3.54s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 6.86s ( 7.15s elapsed) %GC time 47.5% (49.4% elapsed) Alloc rate 338,176,931 bytes per MUT second Productivity 52.4% of total user, 50.3% of total elapsed Modified: $ ./DFBAnagrams +RTS -sstderr < /usr/share/dict/words > /dev/null ./DFBTAnagrams +RTS -sstderr 1,108,946,552 bytes allocated in the heap 237,869,304 bytes copied during GC 41,907,844 bytes maximum residency (10 sample(s)) 4,374,152 bytes maximum slop 89 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 2091 collections, 0 parallel, 1.14s, 1.19s elapsed Generation 1: 10 collections, 0 parallel, 0.40s, 0.50s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 2.90s ( 2.90s elapsed) GC time 1.54s ( 1.69s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.44s ( 4.59s elapsed) %GC time 34.7% (36.8% elapsed) Alloc rate 382,369,915 bytes per MUT second Productivity 65.2% of total user, 63.1% of total elapsed versus 26.25s for the Ruby version (ruby 1.8.7). Yay!
Dave
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners