Re: [Haskell-cafe] Substring replacements

15 Dec 2005

      Hi Bane,

nice algorithm. Since comparing chars _is_ cheap, it is to be expected that 
all the hash-rotating is far more costly for short search patterns. The 
longer the pattern, the better this gets, I think -- though nowhere near KMP 
(or would it?). However, I don't see how to (efficiently) do a multiple 
pattern search with KMP, so there -- if all patterns have the same length, 
otherwise I don't see -- Rabin-Karp would probably be the method of choice.

Am Mittwoch, 14. Dezember 2005 10:16 schrieben Sie:
...
From: Daniel Fischer 
...
To: "Branimir Maksimovic" 
CC: Haskell-Cafe@haskell.org
Subject: Re: [Haskell-cafe] Substring replacements
Date: Tue, 13 Dec 2005 11:23:29 +0100
After seeing that your program is fastest (I've also tried one from
http://haskell.org/hawiki/RunTimeCompilation but perhaps I'm not
that good in converting to search replace?) I've decided to
try with Rabin-Karp algorithm.
This algorithm performs same operation as straightforward search,
but compares hashes instead of chars.
With ability to rotate hash (remove first, add next) characters
there is also optimisation, that hash is calculated only for single
next character rather again for whole substring.
Unfortunatelly on my machine it is very cheap to compare
characters so with my test hashing overweights character compare,
except in your test when hash searching is faster then straightforward
search.
This is best I can write in terms of performance and readability.
I've tried with getFst that returns Maybe but it was slower so I decided
to return '\0' in case that argument is empty list, which renders '\0'
unusable, but then I really doubt that 0 will be used in strings.
-- Rabin-Karp string search algorithm, it is very effective in searching of
set
-- of patterns of length n on same string
-- this program is for single pattern search, but can be crafted
-- for multiple patterns of length m
I tuned it up somewhat:
import Data.List (isPrefixOf)
import Data.Char (ord)  -- using ord instead of fromEnum oddly makes it
-- faster for my test, but slower for yours, but only a whiff.

searchrep :: String -> String -> String -> String
searchrep "" _ str = str    -- or cycle rp, or error?
searchrep sr rp xs = hSearchRep xs  -- don't carry more around than necessary
    where
       len = length sr       -- we don't want that to be recomputed 
       hsrch = hash sr     -- neither that
       hSearchRep  "" = ""
       hSearchRep xs
           | null remaining = passed
	   | otherwise      = passed ++ rp ++ hSearchRep (drop len remaining)
	     where
	       (passed,remaining) = hSearch xs  -- ' xs (hash $ take len xs) ""
       hSearch xs = hSearch' xs hcmp "" -- since hSearch will be optimised
           where                                     -- away anyway, we might
	      hcmp = hash $ take len xs   -- as well eliminate it ourselves
       hSearch' "" _ got = (reverse got, "")
       hSearch' xxs@(x:xs) hcd got
           | hcd == hsrch && (sr `isPrefixOf` xxs) = (reverse got, xxs)
	   | otherwise                             = searchAgain -- one test less
	     where
	       searchAgain = case drop len xxs of
	        []     -> (reverse got ++ xxs, "")   -- then we know we're done
		(y:_)  -> hSearch' xs (hashRotate x y hcd) (x:got)
-- no need for fancy getFst anymore
-- making hashRotate local eliminates one argument, makes it faster
       hashRotate :: Char -> Char -> Int -> Int
       hashRotate cout cin hsh = 101*(hsh - 101^(len-1)*ord cout) + ord cin
-- using foldl for hash is an enormous boost
       hash :: String -> Int
       hash = foldl ((. ord) . (+) . (*101)) 0
-- hash str = foldl (\n c -> 101*n+ord c) 0 str
-- this is equally fast as the point-free version, easier to read, probably,
-- but I like an occasional pointless pointfreeness.

Now this beats everything but KMP on my test very clearly.
dafis@linux:~/Documents/haskell/Allotria/Search> time myhash; time myhash2
Working: seasearch replace  able seaseaseasearch baker ssseasearch charlie
True
Done

real    0m50.401s
user    0m49.990s
sys     0m0.060s
Working very long
True
Done

real    0m15.747s
user    0m15.630s
sys     0m0.020s

Still poor on your test, though.

Cheers,
Daniel