
Hi Bane, nice algorithm. Since comparing chars _is_ cheap, it is to be expected that all the hash-rotating is far more costly for short search patterns. The longer the pattern, the better this gets, I think -- though nowhere near KMP (or would it?). However, I don't see how to (efficiently) do a multiple pattern search with KMP, so there -- if all patterns have the same length, otherwise I don't see -- Rabin-Karp would probably be the method of choice. Am Mittwoch, 14. Dezember 2005 10:16 schrieben Sie:
From: Daniel Fischer
To: "Branimir Maksimovic"
CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Tue, 13 Dec 2005 11:23:29 +0100 After seeing that your program is fastest (I've also tried one from http://haskell.org/hawiki/RunTimeCompilation but perhaps I'm not that good in converting to search replace?) I've decided to try with Rabin-Karp algorithm. This algorithm performs same operation as straightforward search, but compares hashes instead of chars. With ability to rotate hash (remove first, add next) characters there is also optimisation, that hash is calculated only for single next character rather again for whole substring. Unfortunatelly on my machine it is very cheap to compare characters so with my test hashing overweights character compare, except in your test when hash searching is faster then straightforward search.
This is best I can write in terms of performance and readability. I've tried with getFst that returns Maybe but it was slower so I decided to return '\0' in case that argument is empty list, which renders '\0' unusable, but then I really doubt that 0 will be used in strings.
-- Rabin-Karp string search algorithm, it is very effective in searching of set -- of patterns of length n on same string -- this program is for single pattern search, but can be crafted -- for multiple patterns of length m
I tuned it up somewhat: import Data.List (isPrefixOf) import Data.Char (ord) -- using ord instead of fromEnum oddly makes it -- faster for my test, but slower for yours, but only a whiff. searchrep :: String -> String -> String -> String searchrep "" _ str = str -- or cycle rp, or error? searchrep sr rp xs = hSearchRep xs -- don't carry more around than necessary where len = length sr -- we don't want that to be recomputed hsrch = hash sr -- neither that hSearchRep "" = "" hSearchRep xs | null remaining = passed | otherwise = passed ++ rp ++ hSearchRep (drop len remaining) where (passed,remaining) = hSearch xs -- ' xs (hash $ take len xs) "" hSearch xs = hSearch' xs hcmp "" -- since hSearch will be optimised where -- away anyway, we might hcmp = hash $ take len xs -- as well eliminate it ourselves hSearch' "" _ got = (reverse got, "") hSearch' xxs@(x:xs) hcd got | hcd == hsrch && (sr `isPrefixOf` xxs) = (reverse got, xxs) | otherwise = searchAgain -- one test less where searchAgain = case drop len xxs of [] -> (reverse got ++ xxs, "") -- then we know we're done (y:_) -> hSearch' xs (hashRotate x y hcd) (x:got) -- no need for fancy getFst anymore -- making hashRotate local eliminates one argument, makes it faster hashRotate :: Char -> Char -> Int -> Int hashRotate cout cin hsh = 101*(hsh - 101^(len-1)*ord cout) + ord cin -- using foldl for hash is an enormous boost hash :: String -> Int hash = foldl ((. ord) . (+) . (*101)) 0 -- hash str = foldl (\n c -> 101*n+ord c) 0 str -- this is equally fast as the point-free version, easier to read, probably, -- but I like an occasional pointless pointfreeness. Now this beats everything but KMP on my test very clearly. dafis@linux:~/Documents/haskell/Allotria/Search> time myhash; time myhash2 Working: seasearch replace able seaseaseasearch baker ssseasearch charlie True Done real 0m50.401s user 0m49.990s sys 0m0.060s Working very long True Done real 0m15.747s user 0m15.630s sys 0m0.020s Still poor on your test, though. Cheers, Daniel