
All, In order to teach myself Haskell, I've been tinkering with some of the Shootout (http://shootout.alioth.debian.org/great/) programs. Substantially improved the Mandelbrot program. Then started to work on the Spellcheck program, since Haskell seemed to do quite poorly at it. However, my revision appears to be slower than the original. I was hoping that I could get some help from y'all. Original program: import Data.Set main = do d <- readFile "Usr.Dict.Words" let misspelled x = not $ x `elementOf` (mkSet (lines d)) interact $ unlines . filter misspelled . Lines The original program is nice and short, but uses Set to hold the 80,000 word dictionary. The Input file is 80,000 words, too, so I figured that searching the Set dictionary would be an O(n^2) task (80,000 spellchecks x linear search on 80,000 word dictionary Set). All of the other languages used hash tables for this task (O(n) (80,000 spellchecks x hash search on dictionary), so I revised the program to do the same. ---------------------------------------- Revised program: import Data.HashTable import IO import GHC.Int buildHash :: (String -> Int32) -> String -> IO (HashTable String Bool) buildHash hash_fn list = do table <- new (==) hash_fn sequence_ [ insert table k True | k <- lines list ] return table spellcheck :: HashTable String Bool -> String -> String -> IO String spellcheck ds "" accum = return accum spellcheck ds xs accum = let (x, xs') = break (== '\n') xs in do r <- Data.HashTable.lookup ds x case r of Just b -> spellcheck ds (tail xs') accum _ -> spellcheck ds (tail xs') (x ++ accum) main = do ds <- readFile "Usr.Dict.Words" >>= buildHash hashString hGetContents stdin >>= (\xs -> spellcheck ds xs "") >>= print ---------------------------------------- The Revised program is 33% faster than the Original program using the original dataset. However, if I boost the Input to 1.4M words while keeping the dictionary the same, then the two programs are equal in speed. This suggests that loading the dictionary and building the Hash/Set is faster in the Revised program, but that _lookups_ are _slower_ in the Revised program. Is Hashtable.lookup really slower than Set.elementOf? I'm missing something and I don't know what it is. Frustrating. - Alson

Hi Alson! On Thu, Feb 24, 2005 at 07:40:49AM -0500, Alson Kemp wrote:
All, In order to teach myself Haskell, I've been tinkering with some of the Shootout (http://shootout.alioth.debian.org/great/) programs.
Good idea.
Substantially improved the Mandelbrot program. Then started to work on the Spellcheck program, since Haskell seemed to do quite poorly at it.
Welcome to the club :-) http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg06012.html
Original program: import Data.Set main = do d <- readFile "Usr.Dict.Words" let misspelled x = not $ x `elementOf` (mkSet (lines d)) interact $ unlines . filter misspelled . Lines
The original program is nice and short, but uses Set to hold the 80,000 word dictionary. The Input file is 80,000 words, too, so I figured that searching the Set dictionary would be an O(n^2) task (80,000 spellchecks x linear search on 80,000 word dictionary Set).
It's O(n*log n), since a lookup in Data.Set is O(log n). The O(n^2) version would be main = do d <- readFile "Usr.Dict.Words" interact $ unlines . filter (not . (`elem` (lines d))) . Lines :-) The real performance problem propbably is not O(n) vs. O(n*log n) but slow String operations in general. After all, String is a lazy list of Char. Have a look at the thread mentioned above, possibly Simon's version has become fast again, otherwise it would be an opportuniy to reconsider the problems mentioned there. Take care, Carsten -- Carsten Schultz (2:38, 33:47) http://carsten.codimi.de/ PGP/GPG key on the pgp.net key servers, fingerprint on my home page.
participants (2)
-
Alson Kemp
-
Carsten Schultz