Substring replacements

Daniel Fischer

12 Dec 2005 12 Dec '05

12:14 a.m.

Okay, I have looked up KMP and implemented it. Seems to work -- my first use of QuickCheck, too. It's slower than Bulat's and Tomasz' for Branimir's test :-(, but really fast for my test. Undoubtedly, one can still tune it. Here's the code: module KMP where import Data.Array searchReplace :: String -> String -> String -> String searchReplace "" _ str = str searchReplace src@(c:cs) dst str = process 0 str "" where len = length src pat = listArray (0,len-1) src bord = array (0,len) $ (0,-1):(1,0):[(i+1,boun i (bord!i)) | i <- [1 .. len-1]] boun i j | j < 0 = 0 | pat!i == pat!j = j+1 | otherwise = boun i (bord!j) getBord s n | m < 1 = m | s == pat!m = m | otherwise = getBord s m where m = bord!n process n str _ | n >= len = dst ++ process 0 str "" process _ "" mat = reverse mat process 0 (s:st) _ | s == c = process 1 st [s] | otherwise = s:process 0 st "" process n str@(s:st) mat | s == pat!n = process (n+1) st (s:mat) | otherwise = let j = getBord s n (ret,skip) = splitAt j mat in if j < 0 then reverse mat ++ process 0 str "" else reverse skip ++ process j str ret Cheers, Daniel

Show replies by date

Sebastian Sylvan

12 Dec 12 Dec

12:34 a.m.

On 12/12/05, Daniel Fischer wrote:

...

Okay, I have looked up KMP and implemented it. Seems to work -- my first use of QuickCheck, too. It's slower than Bulat's and Tomasz' for Branimir's test :-(, but really fast for my test. Undoubtedly, one can still tune it.

Perhaps by using unboxed arrays... /S -- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

Daniel Fischer

9:31 a.m.

Am Montag, 12. Dezember 2005 01:34 schrieben Sie:

...

On 12/12/05, Daniel Fischer wrote:

...
Okay, I have looked up KMP and implemented it. Seems to work -- my first use of QuickCheck, too. It's slower than Bulat's and Tomasz' for Branimir's test :-(, but really fast for my test. Undoubtedly, one can still tune it.

Perhaps by using unboxed arrays...

/S

-- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

I'm afraid, unboxed arrays are out of the question, because bord is incrementally produced :-( Working very long test2: <<loop>> Cheers, Daniel

Branimir Maksimovic

12:19 p.m.

...

From: Daniel Fischer To: sylvan@student.chalmers.se CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Mon, 12 Dec 2005 10:31:49 +0100

Am Montag, 12. Dezember 2005 01:34 schrieben Sie:

...
On 12/12/05, Daniel Fischer wrote:

...
Okay, I have looked up KMP and implemented it. Seems to work -- my first use of QuickCheck, too. It's slower than Bulat's and Tomasz' for Branimir's test :-(, but really fast for my test. Undoubtedly, one can still tune it.

Perhaps by using unboxed arrays...

/S

-- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

I'm afraid, unboxed arrays are out of the question, because bord is incrementally produced :-(

Working very long test2: <<loop>>

No worrie your test is now fastest with both your and mine test. I;ve forgot to change working function in your test:0) mine test: your program is is srchrep.exe bmaxa@MAXA ~/tutorial $ time searchr.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done real 0m14.344s user 0m0.015s sys 0m0.015s bmaxa@MAXA ~/tutorial $ time srchrep.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done real 0m10.672s > your program is almost 1.5 secs faster then Bulat's user 0m0.015s sys 0m0.000s bmaxa@MAXA ~/tutorial $ time replace1.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done real 0m12.016s user 0m0.015s sys 0m0.015s now your test: bmaxa@MAXA ~/tutorial $ time searchr.exe Working very long True Done real 0m0.312s user 0m0.015s sys 0m0.015s bmaxa@MAXA ~/tutorial $ time replace1.exe Working very long False Done real 0m12.516s user 0m0.015s sys 0m0.015s bmaxa@MAXA ~/tutorial $ time srchrep.exe Working very long True Done real 0m0.375s > yours is less then second as mine but is fastest in both tests user 0m0.015s sys 0m0.015s I don;t know how you get lesser numbers with mine test, but on this machine your KMP algorithm performs best. Greetings ,Bane. _________________________________________________________________ Don't just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/

Branimir Maksimovic

4:20 a.m.

...

From: Daniel Fischer To: Haskell-Cafe@haskell.org Subject: [Haskell-cafe] Substring replacements Date: Mon, 12 Dec 2005 01:14:37 +0100

Okay, I have looked up KMP and implemented it. Seems to work -- my first use of QuickCheck, too. It's slower than Bulat's and Tomasz' for Branimir's test :-(, but really fast for my test.

Strange I got completelly different results: maxa@MAXA ~/tutorial $ time srchrep.exe Working very long True Done real 0m16.407s user 0m0.015s sys 0m0.015s bmaxa@MAXA ~/tutorial $ ghc -fglasgow-exts -O2 srchrep.hs --make -o srchrep.exe Chasing modules from: srchrep.hs Compiling Main ( srchrep.hs, srchrep.o ) Linking ... bmaxa@MAXA ~/tutorial $ time srchrep.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done real 0m10.156s user 0m0.015s sys 0m0.015s bmaxa@MAXA ~/tutorial $ time replace1.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done real 0m11.672s user 0m0.015s sys 0m0.015s Now your version is fastest according to my machine, but it is not faster with your test it's slower in compariton to replace1. I've corrected my code so it is fastest with your test,still less then a second, but slowest with mine. Checked with your fail tests and compared results of these 2 tests. Now should be ok. I maintan now two lists one for successes and other for failures. I also prettified code a bit . searchReplace :: String->String->String -> String searchReplace sr rp xs = searchr sr rp xs "" "" where searchr :: String->String->String->String->String -> String searchr [] _ xs _ _ = xs searchr _ _ [] _ _ = [] searchr sr rp xs retBack rollBack | isFound $ fnd rollBack = rp ++ searchr sr rp (remaining $ fnd rollBack ) ( getRetBack $ fnd rollBack) ( getRollBack $ fnd rollBack) | otherwise = reverse ((processed $ fnd rollBack) ++ rollBack) ++ searchr sr rp (remaining $ fnd rollBack) ( getRetBack $ fnd rollBack) ( getRollBack $ fnd rollBack) where fnd = searchr' sr sr (reverse retBack ++ xs) "" isFound = fst . fst remaining = snd . snd . fst getRollBack = snd . snd getRetBack = fst . snd processed = fst . snd . fst searchr' :: String->String->String->String->String -> ((Bool,(String,String)),(String,String)) searchr' srch@(sr:srs) srch'@(sr':srs') xs fndSoFar rollBack = searchr'' (drop (length rollBack) srch) srch' xs fndSoFar (False,"","") sr searchr'' :: String->String->String->String->(Bool,String,String)->Char -> ((Bool,(String,String)),(String,String)) searchr'' [] _ xs fnd _ _ = ((True,(fnd,xs)),("","")) searchr'' _ _ [] fnd (_,retBack,rollBack) _ = ((False,(retBack ++ rollBack ++ fnd,[])),("","")) searchr'' srch@(sr:srs) srch'@(sr':srs') xxs@(x:xs) fndSoFar (cnt,retBack,rollBack) s | sr == x = if cnt && sr' == x && isEmpty retBack then searchr'' srs srs' xs fndSoFar (True,"",x:rollBack) s else if not (isEmpty retBack) || not (isEmpty rollBack) then searchr'' srs srch' xs fndSoFar (True,(x:rollBack) ++ retBack,"") s else searchr'' srs srch' xs (x:fndSoFar) (True,"","") s | otherwise = if isEmpty rollBack && isEmpty retBack then if s == x then ((False,(fndSoFar,xxs)),("","")) else ((False,searchr''' s xs (x:fndSoFar)),("","")) else if sr' == x && isEmpty retBack then ((False,(fndSoFar, xs)), (retBack,x:rollBack)) else ((False,(fndSoFar, xxs)), (retBack,rollBack)) searchr''' :: Char->String->String -> (String,String) searchr''' sr [] fndSoFar = (fndSoFar,[]) searchr''' sr xxs@(x:xs) fndSoFar | sr/=x = searchr''' sr xs (x:fndSoFar) | otherwise = (fndSoFar,xxs) isEmpty [] = True isEmpty (a:as) = False ------------------------------------------------------------------------------- Greetings, Bane. _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Daniel Fischer

12:07 p.m.

Sorry, but Prelude SearchRep> searchReplace "abaaba" "##" "abababaaba" "abababaaba" I haven't analyzed the algorithm, so I don't know why exactly this fails. I'll take a look sometime soon. As for times: a complete stat is attached, all compiled with -O2, as well as the modified KMP-version and my transcript of Branimir's new version. On my computer, KMP smashes everything else on my test (except Branimir's, only that doesn't yet work correctly), while Bulat's definitely is faster than anything for Branimir's test and faster than anything but KMP for mine. Branimir, isEmpty is the Prelude function 'null', so you needn't define it yourself. Cheers, Daniel Am Montag, 12. Dezember 2005 05:20 schrieben Sie:

...

From: Daniel Fischer

...
To: Haskell-Cafe@haskell.org Subject: [Haskell-cafe] Substring replacements Date: Mon, 12 Dec 2005 01:14:37 +0100

Okay, I have looked up KMP and implemented it. Seems to work -- my first use of QuickCheck, too. It's slower than Bulat's and Tomasz' for Branimir's test :-(, but really fast for my test.

Strange I got completelly different results:

maxa@MAXA ~/tutorial $ time srchrep.exe Working very long True Done

real 0m16.407s user 0m0.015s sys 0m0.015s

bmaxa@MAXA ~/tutorial $ ghc -fglasgow-exts -O2 srchrep.hs --make -o srchrep.exe Chasing modules from: srchrep.hs Compiling Main ( srchrep.hs, srchrep.o ) Linking ...

bmaxa@MAXA ~/tutorial $ time srchrep.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done

real 0m10.156s user 0m0.015s sys 0m0.015s

bmaxa@MAXA ~/tutorial $ time replace1.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done

real 0m11.672s user 0m0.015s sys 0m0.015s

Now your version is fastest according to my machine, but it is not faster with your test it's slower in compariton to replace1.

I've corrected my code so it is fastest with your test,still less then a second, but slowest with mine. Checked with your fail tests and compared results of these 2 tests. Now should be ok. I maintan now two lists one for successes and other for failures. I also prettified code a bit .

searchReplace :: String->String->String -> String searchReplace sr rp xs = searchr sr rp xs "" "" where searchr :: String->String->String->String->String -> String searchr [] _ xs _ _ = xs searchr _ _ [] _ _ = [] searchr sr rp xs retBack rollBack

| isFound $ fnd rollBack = rp

++ searchr sr rp (remaining $ fnd rollBack ) ( getRetBack $ fnd rollBack) ( getRollBack $ fnd rollBack)

| otherwise = reverse ((processed $ fnd rollBack) ++

rollBack) ++ searchr sr rp (remaining $ fnd rollBack) ( getRetBack $ fnd rollBack) ( getRollBack $ fnd rollBack) where fnd = searchr' sr sr (reverse retBack ++ xs) ""

isFound = fst . fst remaining = snd . snd . fst getRollBack = snd . snd getRetBack = fst . snd processed = fst . snd . fst

searchr' :: String->String->String->String->String -> ((Bool,(String,String)),(String,String)) searchr' srch@(sr:srs) srch'@(sr':srs') xs fndSoFar rollBack = searchr'' (drop (length rollBack) srch) srch' xs fndSoFar (False,"","") sr

searchr'' :: String->String->String->String->(Bool,String,String)->Char -> ((Bool,(String,String)),(String,String)) searchr'' [] _ xs fnd _ _ = ((True,(fnd,xs)),("","")) searchr'' _ _ [] fnd (_,retBack,rollBack) _ = ((False,(retBack ++ rollBack ++ fnd,[])),("","")) searchr'' srch@(sr:srs) srch'@(sr':srs') xxs@(x:xs) fndSoFar (cnt,retBack,rollBack) s

| sr == x = if cnt && sr' == x && isEmpty retBack

then searchr'' srs srs' xs fndSoFar (True,"",x:rollBack) s else if not (isEmpty retBack) || not (isEmpty rollBack) then searchr'' srs srch' xs fndSoFar (True,(x:rollBack) ++ retBack,"") s else searchr'' srs srch' xs (x:fndSoFar) (True,"","") s

| otherwise = if isEmpty rollBack && isEmpty retBack

then if s == x then ((False,(fndSoFar,xxs)),("","")) else ((False,searchr''' s xs (x:fndSoFar)),("","")) else if sr' == x && isEmpty retBack then ((False,(fndSoFar, xs)), (retBack,x:rollBack)) else ((False,(fndSoFar, xxs)), (retBack,rollBack))

searchr''' :: Char->String->String -> (String,String) searchr''' sr [] fndSoFar = (fndSoFar,[]) searchr''' sr xxs@(x:xs) fndSoFar | sr/=x = searchr''' sr xs (x:fndSoFar)

| otherwise = (fndSoFar,xxs)

isEmpty [] = True isEmpty (a:as) = False --------------------------------------------------------------------------- ----

Greetings, Bane.

_________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Daniel Fischer

3:15 p.m.

Earlier today:

...

Sorry, but Prelude SearchRep> searchReplace "abaaba" "##" "abababaaba" "abababaaba"

I haven't analyzed the algorithm, so I don't know why exactly this fails. I'll take a look sometime soon.

I found the problem (one at least). Say the pattern to be replaced begins with 'a' and we have a sufficiently long match with the pattern starting at the first 'a' in the String. Upon encountering the second 'a', while the first pattern still matches, you start pushing onto the rollback-stack. But that isn't inspected anymore, so if the actual occurence of the pattern starts at the third (or fourth, n-th) occurence of 'a' and that is already pushed onto the rollback, you miss it. let src = concat (replicate n "abc") ++ "d" let str = concat (replicate (n+k) "abc") ++ "d" then searchReplace src "Success!" str will work correctly iff k is congruent to 0 or 1 modulo (n+1). Now to fix it, I see two possibilities 1. re-inspect the rollback, which brings you basically to my srchrep 2. introduce a hierarchy of rollback-stacks - but that would be rather horrible, I think, because you must keep count on which stack you have to push, how many matching patterns you currently have (and which ones they are) ... So the question is, can we find a cheap test to decide whether to use KMP or Bulat's version? Cheers, Daniel

Branimir Maksimovic

3:28 p.m.

...

From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Mon, 12 Dec 2005 16:15:46 +0100

Earlier today:

...
Sorry, but Prelude SearchRep> searchReplace "abaaba" "##" "abababaaba" "abababaaba"

I haven't analyzed the algorithm, so I don't know why exactly this fails. I'll take a look sometime soon.

I found the problem (one at least). Say the pattern to be replaced begins with 'a' and we have a sufficiently long match with the pattern starting at the first 'a' in the String. Upon encountering the second 'a', while the first pattern still matches, you start pushing onto the rollback-stack. But that isn't inspected anymore, so if the actual occurence of the pattern starts at the third (or fourth, n-th) occurence of 'a' and that is already pushed onto the rollback, you miss it.

I've corrected this with adjusting rollback position. if rollBack is null then search for rollback starts at second character if not starts at same as searhed character because I skip what was searched. That's all. Though I'm not so sure now when I read this.

...

So the question is, can we find a cheap test to decide whether to use KMP or Bulat's version?

In real world situation your KMP will always be fastest on average. I like that we are not using C arrays as then we have advantage of lazyness and save on memory usage. C++ program will be faster on shorter strings but on this large strings will loose due memory latency. and with your test, both programs are very fast. Greetings, Bane. _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Daniel Fischer

13 Dec 13 Dec

10:23 a.m.

Am Montag, 12. Dezember 2005 16:28 schrieben Sie:

...

From: Daniel Fischer

...
To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Mon, 12 Dec 2005 16:15:46 +0100

Earlier today:

...
Sorry, but Prelude SearchRep> searchReplace "abaaba" "##" "abababaaba" "abababaaba"

I haven't analyzed the algorithm, so I don't know why exactly this

fails.

...
I'll take a look sometime soon.

I found the problem (one at least). Say the pattern to be replaced begins with 'a' and we have a sufficiently long match with the pattern starting at the first 'a' in the String. Upon encountering the second 'a', while the first pattern still matches, you start pushing onto the rollback-stack. But that isn't inspected anymore, so if the actual occurence of the pattern starts at the third (or fourth, n-th) occurence of 'a' and that is already pushed onto the rollback, you miss it.

I've corrected this with adjusting rollback position. if rollBack is null then search for rollback starts at second character if not starts at same as searhed character because I skip what was searched. That's all. Though I'm not so sure now when I read this.

Still not working: *New> searchReplace "abababc" "#" "ababababababc" "ababababababc" *New> searchReplace1 "abababc" "#" "ababababababc" "ababababababc"

...

...
So the question is, can we find a cheap test to decide whether to use KMP or Bulat's version?

In real world situation your KMP will always be fastest on average. I like that we are not using C arrays as then we have advantage of lazyness and save on memory usage. C++ program will be faster on shorter strings but on this large strings will loose due memory latency. and with your test, both programs are very fast.

Greetings, Bane.

On my 256MB RAM AMD Duron 1200 MHz, Bulat's version is consistently about 20% faster than my KMP on your test -- btw, I unboxed the pat array, which gave a bit of extra speed, but not much. And apologies to Sebastian Sylvan, I also included an unboxed version of bord, built from the boxed version, and that sped things further up -- not much, again, but there it is. I wonder about this difference, -10% on one system and +20% on another system, ist that normal? Cheers, Daniel ------------------------------------------------------------------------------------------ Up-To-Date version of KMP: import Data.Array.Unboxed (UArray, listArray, (!)) import qualified Data.Array as A (array, (!), elems) searchReplace :: String -> String -> String -> String searchReplace "" _ str = str searchReplace src@(c:cs) dst str = process 0 str "" where len = {-# scc "len" #-} length src pat :: UArray Int Char pat = {-# scc "pat" #-} listArray (0,len-1) src bord ={-# scc "bord" #-} A.array (0,len) $ (0,-1):(1,0): [(i+1,getBord (pat!i) i + 1) | i <- [1 .. len-1]] getBord s n | m < 0 = m | s == pat!m = m | otherwise = getBord s m where m = bord A.! n bor :: UArray Int Int bor = listArray (0,len) $ A.elems bord getBor s n | m < 0 || s == pat!m = m | otherwise = getBor s m where m = bor!n process n str _ | n >= len = {-# scc "process" #-} dst ++ process 0 str "" process _ "" mat = {-# scc "process" #-} reverse mat process 0 (s:st) _ | s == c = {-# scc "process" #-} process 1 st [s] | otherwise = {-# scc "process" #-} s:process 0 st "" process n str@(s:st) mat | s == pat!n = {-# scc "process" #-} process (n+1) st (s:mat) | otherwise = {-# scc "process" #-} let j = getBor s n (ret,skip) = splitAt j mat in if j < 0 then reverse mat ++ process 0 str "" else reverse skip ++ process (j+1) st (s:ret)

Branimir Maksimovic

1:22 p.m.

...

From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Tue, 13 Dec 2005 11:23:29 +0100

Am Montag, 12. Dezember 2005 16:28 schrieben Sie:

...
From: Daniel Fischer

...
To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Mon, 12 Dec 2005 16:15:46 +0100

Earlier today:

...
Sorry, but Prelude SearchRep> searchReplace "abaaba" "##" "abababaaba" "abababaaba"

I haven't analyzed the algorithm, so I don't know why exactly this

fails.

...
I'll take a look sometime soon.

I found the problem (one at least). Say the pattern to be replaced begins with 'a' and we have a sufficiently long match with the pattern starting at the first 'a' in the String. Upon encountering the second 'a', while the first pattern still matches, you start pushing onto the rollback-stack. But that isn't inspected anymore, so if the actual occurence of the pattern starts at the third (or fourth, n-th) occurence of 'a' and that is already pushed onto the rollback, you miss it.

I've corrected this with adjusting rollback position. if rollBack is null then search for rollback starts at second character if not starts at same as searhed character because I skip what was searched. That's all. Though I'm not so sure now when I read this.

Still not working:

*New> searchReplace "abababc" "#" "ababababababc" "ababababababc" *New> searchReplace1 "abababc" "#" "ababababababc" "ababababababc"

Yes, perhaps you've missed another post of mine. I've noticed that problem when pattern repeats more then 2 times and gave up because now whatever I do, your version is always fastest.

...

...
...
So the question is, can we find a cheap test to decide whether to use KMP or Bulat's version?

Just interleave string with search hits with one with no seacrh (that means partial too) hits, and your version will gain in speed. More partial matches and full search matches Bulat's version will gain in speed. Longer search strings, your version will have gains.

...

...
In real world situation your KMP will always be fastest on average. I like that we are not using C arrays as then we have advantage of lazyness and save on memory usage. C++ program will be faster on shorter strings but on this large strings will loose due memory latency. and with your test, both programs are very fast.

Greetings, Bane.

On my 256MB RAM AMD Duron 1200 MHz, Bulat's version is consistently about 20% faster than my KMP on your test -- btw, I unboxed the pat array, which gave a bit of extra speed, but not much.

I think that's because on your machine Bulat's version have better perfromance with CPU cache. I don;t know but now your version is 25% faster with my test on P4 hyperthreaded. your new version: $ time srchrep.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done real 0m8.734s user 0m0.015s sys 0m0.000s Bulat's version: bmaxa@MAXA ~/tutorial $ time replace1.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done real 0m11.734s user 0m0.015s sys 0m0.015s 3 secs difference now.

...

And apologies to Sebastian Sylvan, I also included an unboxed version of bord, built from the boxed version, and that sped things further up -- not much, again, but there it is.

On my machine you got another 10-15% of boost with unboxed arrays.

...

I wonder about this difference, -10% on one system and +20% on another system, ist that normal?

Different caching schemes on CPU's perhaps? different memory latencies? hyperthreading helps your version? more code and data, perhaps because of that it pays the price on your machine? Greetings, Bane. _________________________________________________________________ Don't just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/

Daniel Fischer

14 Dec 14 Dec

4:10 p.m.

Hi, Bane and all, Am Dienstag, 13. Dezember 2005 14:22 schrieben Sie:

...

...
...
In real world situation your KMP will always be fastest on average. I like that we are not using C arrays as then we have advantage of lazyness and save on memory usage. C++ program will be faster on shorter strings but on this large strings will loose due memory latency. and with your test, both programs are very fast.

Yesterday, I did the unspeakable -- I wrote a C-version. Smashes Haskell-performance for short enough Strings (factor 10 for my test, factor 2.2 for Bane's), but once it starts swapping, we catch up, and for really large Strings I dare say we'd win far out. I also managed to get my KMP still faster, using take and drop instead of splitAt helps a lot (Bane, the use of 'break' in my transcript of yours was what slowed it down, I reintroduced searchr''' and both are equal). I'm not quite sure, whether that indeed helps, but I also chose to use listArray for the boxed array of borders. Now it's searchReplace :: String -> String -> String -> String searchReplace "" _ str = str searchReplace src@(c:cs) dst str = process 0 str "" where len = length src pat :: UArray Int Char pat = listArray (0,len-1) src bord = A.listArray (0,len) $ (-1):0: [getBord (pat!i) i + 1 | i <- [1 .. len-1]] getBord s n | m < 0 = m | s == pat!m = m | otherwise = getBord s m where m = bord A.! n bor :: UArray Int Int bor = listArray (0,len) $ A.elems bord getBor s n | m < 0 || s == pat!m = m | otherwise = getBor s m where m = bor!n process n str _ | n >= len = dst ++ process 0 str "" process _ "" mat = reverse mat process 0 (s:st) _ | s == c = process 1 st [s] | otherwise = s:process 0 st "" process n str@(s:st) mat | s == pat!n = process (n+1) st (s:mat) | otherwise = case getBor s n of -1 -> reverse mat ++ process 0 str "" 0 -> reverse mat ++ process 1 st [s] j -> reverse (drop j mat) ++ process (j+1) st (s:take j mat) gives a speedup of roughly 10% on my box versus yesterday's version.

...

...
...
Greetings, Bane.

On my 256MB RAM AMD Duron 1200 MHz, Bulat's version is consistently about 20% faster than my KMP on your test -- btw, I unboxed the pat array, which gave a bit of extra speed, but not much.

I think that's because on your machine Bulat's version have better perfromance with CPU cache. I don;t know but now your version is 25% faster with my test on P4 hyperthreaded.

Errrr, what's 'hyperthreaded' ? Unfortunately, I'm completely useless with computers.

...

your new version: $ time srchrep.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done

real 0m8.734s user 0m0.015s sys 0m0.000s

Bulat's version:

bmaxa@MAXA ~/tutorial $ time replace1.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done

real 0m11.734s user 0m0.015s sys 0m0.015s

3 secs difference now.

...
And apologies to Sebastian Sylvan, I also included an unboxed version of bord, built from the boxed version, and that sped things further up -- not much, again, but there it is.

On my machine you got another 10-15% of boost with unboxed arrays.

...
I wonder about this difference, -10% on one system and +20% on another system, ist that normal?

Different caching schemes on CPU's perhaps? different memory latencies? hyperthreading helps your version? more code and data, perhaps because of that it pays the price on your machine?

Greetings, Bane.

Well, whatever. Upto now, on my box, Bulat's is still the fastest for your test -- though I've narrowed the gap quite a bit. Cheers, Daniel

Branimir Maksimovic

15 Dec 15 Dec

12:22 a.m.

...

From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Wed, 14 Dec 2005 17:10:20 +0100

...

...
I think that's because on your machine Bulat's version have better perfromance with CPU cache. I don;t know but now your version is 25% faster with my test on P4 hyperthreaded.

Errrr, what's 'hyperthreaded' ? Unfortunately, I'm completely useless with computers.

I think that i've figure it now. Hyperthreading is hardware CPU feature that single CPU core can speed up execution of two running threads. For example if one thread uses integer unit and other FP unit CPU executes that in parallel. But that's not important or significant. What is interestenting is memory latency. If one thread peeks and pokes around memory for , say 1 unit of time, with usual CPU two thread will execute 2 units of time. Hyperthreaded (I'm talking about intel implementation) CPU will execute that in 1.4 points of time giving 60% boost in terms of speed. I've tested some assembler and C program that launches two threads each roaming over memory to anulate impact of cache. What is noticable is that two threads have 60% less memory latency constantly then single thread. That means if single thread for each out of cache memory access waits 300-400 CPU cycles, two threads wait 60% less. Now what has that to do with our programs as they are single threaded? I think it's garbage collection. Our programs run with garbage collector in background and you feel that burden by 20% as your program probably pushes garbage collector to work more than Bulat's version. On hyperthreaded CPU impact of garbage collection is reduced by a factor of 30-60 % resulting in your program being 30% faster on my machine. Greetings, Bane. _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Branimir Maksimovic

14 Dec 14 Dec

9:16 a.m.

...

From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Tue, 13 Dec 2005 11:23:29 +0100

After seeing that your program is fastest (I've also tried one from http://haskell.org/hawiki/RunTimeCompilation but perhaps I'm not that good in converting to search replace?) I've decided to try with Rabin-Karp algorithm. This algorithm performs same operation as straightforward search, but compares hashes instead of chars. With ability to rotate hash (remove first, add next) characters there is also optimisation, that hash is calculated only for single next character rather again for whole substring. Unfortunatelly on my machine it is very cheap to compare characters so with my test hashing overweights character compare, except in your test when hash searching is faster then straightforward search. This is best I can write in terms of performance and readability. I've tried with getFst that returns Maybe but it was slower so I decided to return '\0' in case that argument is empty list, which renders '\0' unusable, but then I really doubt that 0 will be used in strings. -- Rabin-Karp string search algorithm, it is very effective in searching of set -- of patterns of length n on same string -- this program is for single pattern search, but can be crafted -- for multiple patterns of length m hSearchReplace :: String -> String -> String -> String hSearchReplace sr rp xs | not (null remaining) = found ++ rp ++ hSearchReplace sr rp (drop (length sr) remaining) | otherwise = found where (found,remaining) = hSearch sr xs hSearch :: String -> String -> (String,String) hSearch sr xs = hSearch' sr xs hcmp "" where hsrch = hash sr hcmp = hash $ take ls xs cmp = take ls xs ls = length sr hSearch' [] xs _ _= (xs,[]) hSearch' sr [] _ fndFail = (reverse fndFail,[]) hSearch' srch xxs@(x:xs) hcmps fndFail = if hsrch == hcmps then if isPrefixOf srch xxs then (reverse fndFail,xxs) else searchAgain else searchAgain where searchAgain = hSearch' srch xs (hashRotate (getFst xxs) (getFst nextxxs) (ls-1) hcmps) (x:fndFail) nextxxs = drop ls xxs getFst :: String -> Char getFst [] = '\0' getFst (a:as) = a hash :: String -> Int hash str = hash' str (length str - 1) where hash' :: String -> Int -> Int hash' [] _ = 0 hash' (s:str) pow = (101 ^ pow) *(fromEnum s) + hash' str (pow-1) hashRotate :: Char -> Char -> Int -> Int -> Int hashRotate cout cin pow hsh = (hsh - ((101 ^ pow) * (fromEnum cout)))*101 + (fromEnum cin) Greetings, Bane. _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Daniel Fischer

7:40 p.m.

Hi Bane, nice algorithm. Since comparing chars _is_ cheap, it is to be expected that all the hash-rotating is far more costly for short search patterns. The longer the pattern, the better this gets, I think -- though nowhere near KMP (or would it?). However, I don't see how to (efficiently) do a multiple pattern search with KMP, so there -- if all patterns have the same length, otherwise I don't see -- Rabin-Karp would probably be the method of choice. Am Mittwoch, 14. Dezember 2005 10:16 schrieben Sie:

...

From: Daniel Fischer

...
To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Tue, 13 Dec 2005 11:23:29 +0100

After seeing that your program is fastest (I've also tried one from http://haskell.org/hawiki/RunTimeCompilation but perhaps I'm not that good in converting to search replace?) I've decided to try with Rabin-Karp algorithm. This algorithm performs same operation as straightforward search, but compares hashes instead of chars. With ability to rotate hash (remove first, add next) characters there is also optimisation, that hash is calculated only for single next character rather again for whole substring. Unfortunatelly on my machine it is very cheap to compare characters so with my test hashing overweights character compare, except in your test when hash searching is faster then straightforward search.

This is best I can write in terms of performance and readability. I've tried with getFst that returns Maybe but it was slower so I decided to return '\0' in case that argument is empty list, which renders '\0' unusable, but then I really doubt that 0 will be used in strings.

-- Rabin-Karp string search algorithm, it is very effective in searching of set -- of patterns of length n on same string -- this program is for single pattern search, but can be crafted -- for multiple patterns of length m

I tuned it up somewhat: import Data.List (isPrefixOf) import Data.Char (ord) -- using ord instead of fromEnum oddly makes it -- faster for my test, but slower for yours, but only a whiff. searchrep :: String -> String -> String -> String searchrep "" _ str = str -- or cycle rp, or error? searchrep sr rp xs = hSearchRep xs -- don't carry more around than necessary where len = length sr -- we don't want that to be recomputed hsrch = hash sr -- neither that hSearchRep "" = "" hSearchRep xs | null remaining = passed | otherwise = passed ++ rp ++ hSearchRep (drop len remaining) where (passed,remaining) = hSearch xs -- ' xs (hash $ take len xs) "" hSearch xs = hSearch' xs hcmp "" -- since hSearch will be optimised where -- away anyway, we might hcmp = hash $ take len xs -- as well eliminate it ourselves hSearch' "" _ got = (reverse got, "") hSearch' xxs@(x:xs) hcd got | hcd == hsrch && (sr `isPrefixOf` xxs) = (reverse got, xxs) | otherwise = searchAgain -- one test less where searchAgain = case drop len xxs of [] -> (reverse got ++ xxs, "") -- then we know we're done (y:_) -> hSearch' xs (hashRotate x y hcd) (x:got) -- no need for fancy getFst anymore -- making hashRotate local eliminates one argument, makes it faster hashRotate :: Char -> Char -> Int -> Int hashRotate cout cin hsh = 101*(hsh - 101^(len-1)*ord cout) + ord cin -- using foldl for hash is an enormous boost hash :: String -> Int hash = foldl ((. ord) . (+) . (*101)) 0 -- hash str = foldl (\n c -> 101*n+ord c) 0 str -- this is equally fast as the point-free version, easier to read, probably, -- but I like an occasional pointless pointfreeness. Now this beats everything but KMP on my test very clearly. dafis@linux:~/Documents/haskell/Allotria/Search> time myhash; time myhash2 Working: seasearch replace able seaseaseasearch baker ssseasearch charlie True Done real 0m50.401s user 0m49.990s sys 0m0.060s Working very long True Done real 0m15.747s user 0m15.630s sys 0m0.020s Still poor on your test, though. Cheers, Daniel

Branimir Maksimovic

15 Dec 15 Dec

12:55 a.m.

...

From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Wed, 14 Dec 2005 20:40:06 +0100

Hi Bane,

nice algorithm. Since comparing chars _is_ cheap, it is to be expected that all the hash-rotating is far more costly for short search patterns. The longer the pattern, the better this gets, I think -- though nowhere near KMP (or would it?).

Yes,KMP is superior in single pattern search. We didn't tried Boyer-Moore algorithm yet, though. But I think it would be difficult to implement it in Haskell efficiently as it searches backwards and jumps around, and we want memory savings. Though, I even didn't tried yet, but it is certainly very interesting. However, I don't see how to (efficiently) do a multiple

...

pattern search with KMP, so there -- if all patterns have the same length, otherwise I don't see -- Rabin-Karp would probably be the method of choice.

Yes, this algorithm can search in parallel patterns of same length. Different search patterns have to be searched same way as with KMP.

...

I tuned it up somewhat: import Data.List (isPrefixOf) import Data.Char (ord) -- using ord instead of fromEnum oddly makes it -- faster for my test, but slower for yours, but only a whiff.

Wow, on my machine your version of Rabin-Karp gives 30% boost to my test. This helps me learn Haskell , too . Greetings, Bane. _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Branimir Maksimovic

1:39 a.m.

...

From: "Branimir Maksimovic" To: daniel.is.fischer@web.de CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Thu, 15 Dec 2005 00:55:02 +0000

...
From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Wed, 14 Dec 2005 20:40:06 +0100

Hi Bane,

nice algorithm. Since comparing chars _is_ cheap, it is to be expected that all the hash-rotating is far more costly for short search patterns. The longer the pattern, the better this gets, I think -- though nowhere near KMP (or would it?).

Yes,KMP is superior in single pattern search. We didn't tried Boyer-Moore algorithm yet, though. But I think it would be difficult to implement it in Haskell efficiently as it searches backwards and jumps around, and we want memory savings. Though, I even didn't tried yet, but it is certainly very interesting.

Forget what I've said. Boyer-Moore *can* be implemented efficiently, it is similar to KMP it goes forward, but when it finds last character in pattern, than starts to search backwards. This can be implemented easilly as Haskell lists naturaly reverse order when putting from one list to other. Heh, never say never :) As I see from documents Boyer-Moore has best performance on average and should be better than KMP. Greetings,Bane. _________________________________________________________________ FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/

Branimir Maksimovic

6:21 p.m.

This is what I got for BM. Performance dissapoints as BM is really suited for indexed strings like arrays.It mainly operates on indexes. This is simple BM, as I didn't want to go for more complex variant,becauses takes and drops and recalculation of next position is too pricey for non indexed structure. So, clear winner is KMP for non indexed strings. There is also finite automaton algorithm but this works well if search strings are precompiled, so I'll implement it only for education purposes. I hope my Haskell improves as I've learned how to reduce number of paramaters. searchReplaceBM :: String -> String -> String -> String searchReplaceBM "" _ str = str searchReplaceBM sr rp str = searchReplace str where table :: UArray Int Int table = array (0,255) ([(i,0) | i <- [0..255]] ++ proc sr 1) proc [] _ = [] proc (s:st) i = (ord s,i):proc st (i+1) len = length sr rsrch = reverse sr searchReplace str | null remaining = if found then rp else passed |found = rp ++ searchReplace remaining | otherwise = passed ++ searchReplace remaining where (passed,remaining,found) = searchReplace' str searchReplace' str = if j == 0 then ("",drop len str,True) else failed where failed = case drop (j-1) str of [] -> (str,"",False) (c:_) -> (take sk str, drop sk str, False) where md = j - table ! ord c sk = if md > 0 then md else 1 j = srch rsrch (reverse $ take len str) len where srch "" "" _ = 0 srch _ "" l = l srch (s:str) (s':str') l | s == s' = srch str str' (l-1) | otherwise = l Greetings, Bane.

...

From: "Branimir Maksimovic" To: bmaxa@hotmail.com, daniel.is.fischer@web.de CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Thu, 15 Dec 2005 01:39:57 +0000

...
From: "Branimir Maksimovic" To: daniel.is.fischer@web.de CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Thu, 15 Dec 2005 00:55:02 +0000

...
From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Wed, 14 Dec 2005 20:40:06 +0100

Hi Bane,

nice algorithm. Since comparing chars _is_ cheap, it is to be expected that all the hash-rotating is far more costly for short search patterns. The longer the pattern, the better this gets, I think -- though nowhere near KMP (or would it?).

Yes,KMP is superior in single pattern search. We didn't tried Boyer-Moore algorithm yet, though. But I think it would be difficult to implement it in Haskell efficiently as it searches backwards and jumps around, and we want memory savings. Though, I even didn't tried yet, but it is certainly very interesting.

Forget what I've said. Boyer-Moore *can* be implemented efficiently, it is similar to KMP it goes forward, but when it finds last character in pattern, than starts to search backwards. This can be implemented easilly as Haskell lists naturaly reverse order when putting from one list to other. Heh, never say never :) As I see from documents Boyer-Moore has best performance on average and should be better than KMP.

Greetings,Bane.

_________________________________________________________________ FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/

_________________________________________________________________ Don't just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/

Daniel Fischer

8:07 p.m.

Am Donnerstag, 15. Dezember 2005 02:39 schrieben Sie:

...

From: "Branimir Maksimovic"

...
To: daniel.is.fischer@web.de CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Thu, 15 Dec 2005 00:55:02 +0000

...
From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Wed, 14 Dec 2005 20:40:06 +0100

Hi Bane,

nice algorithm. Since comparing chars _is_ cheap, it is to be expected that all the hash-rotating is far more costly for short search patterns. The longer the pattern, the better this gets, I think -- though nowhere near KMP (or would it?).

Yes,KMP is superior in single pattern search. We didn't tried Boyer-Moore algorithm yet, though. But I think it would be difficult to implement it in Haskell efficiently as it searches backwards and jumps around, and we want memory savings. Though, I even didn't tried yet, but it is certainly very interesting.

Forget what I've said. Boyer-Moore *can* be implemented efficiently, it is similar to KMP it goes forward, but when it finds last character in pattern, than starts to search backwards. This can be implemented easilly as Haskell lists naturaly reverse order when putting from one list to other. Heh, never say never :) As I see from documents Boyer-Moore has best performance on average and should be better than KMP.

Greetings,Bane.

Well, I also thought that all the jumping around in Boyer-Moore wasn't too good (after each shift we must bite off a chunk from the remaining input, pushing that onto the stack, which costs something). But I gave it a try today and here's what I came up with: import Data.List (tails) import Data.Map (Map) import qualified Data.Map as Map import Data.Array.Unboxed searchRep :: String -> String -> String -> String searchRep src rp str = run (reverse $ take len1 str) $ drop len1 str where len = length src len1 = len-1 pat :: UArray Int Char pat = listArray (0,len1) src ch = pat!len1 badChar :: Map Char Int badChar = Map.fromList $ zip src [0 .. ] getBc c = case Map.lookup c badChar of Just n -> n Nothing -> -1 suffs :: UArray Int Int suffs = listArray (0,len1) $! init $! map (pr 0 crs) $! tails crs where crs = reverse src pr n (x:xs) (y:ys) | x == y = pr (n+1) xs ys pr n _ _ = n bmGs0 :: UArray Int Int bmGs0 = array (0,len1) [(j,k) | (k,k') <- zip (tail $! help) help, j <- [k' .. k-1]] help = [k | k <- [0 .. len], k == len || suffs!k == len-k] bmGs :: UArray Int Int bmGs = bmGs0 // [(len1-suffs!k,k) | k <- [len1,len-2 .. 1]] run by "" = reverse by run by (c:cs) | c == ch = process (c:by) cs | otherwise = run (c:by) cs roll n xs ys | n <= 0 = (xs, ys) roll n xs (y:ys) = roll (n-1) (y:xs) ys roll _ xs "" = (xs, "") walk n "" = (n,"") walk n st@(c:cs) | n < 0 = (n,st) | c == pat!n = walk (n-1) cs | otherwise = (n,st) process con left | i < 0 = reverse pass ++ rp ++ run "" left | otherwise = {- bye ++ -} run ncon nleft where (i,pass) = walk len1 con d = if null pass then i+1 else max (bmGs!i) (i - getBc (head pass)) -- bye = reverse $! drop (len-d) con (ncon,nleft) = roll (d-1) {- (take (len-d) con) -} con left it's not as fast as KMP for the tests, but not too bad. Commenting out 'bye' gives a bit of extra speed, but if it's _long_ before a match (if any), we'd be better off relieving our memory with 'bye', I think. Any improvements are welcome, certainly some of you can do much better. Cheers, Daniel P.S. As an algorithm, I prefer KMP, it's quite elegant whereas BM is somewhat fussy.

Branimir Maksimovic

16 Dec 16 Dec

2:36 a.m.

...

From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Thu, 15 Dec 2005 21:07:11 +0100

Am Donnerstag, 15. Dezember 2005 02:39 schrieben Sie:

...
From: "Branimir Maksimovic"

...
To: daniel.is.fischer@web.de CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Thu, 15 Dec 2005 00:55:02 +0000

...
From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Wed, 14 Dec 2005 20:40:06 +0100

Hi Bane,

nice algorithm. Since comparing chars _is_ cheap, it is to be expected that all the hash-rotating is far more costly for short search patterns. The longer the pattern, the better this gets, I think -- though nowhere near KMP (or would it?).

Yes,KMP is superior in single pattern search. We didn't tried Boyer-Moore algorithm yet, though. But I think it would be difficult to implement it in Haskell efficiently as it searches backwards and jumps around, and we want memory savings. Though, I even didn't tried yet, but it is certainly very interesting.

Forget what I've said. Boyer-Moore *can* be implemented efficiently, it is similar to KMP it goes forward, but when it finds last character in pattern, than starts to search backwards. This can be implemented easilly as Haskell lists naturaly reverse order when putting from one list to other. Heh, never say never :) As I see from documents Boyer-Moore has best performance on average and should be better than KMP.

Greetings,Bane.

Well, I also thought that all the jumping around in Boyer-Moore wasn't too good (after each shift we must bite off a chunk from the remaining input, pushing that onto the stack, which costs something). But I gave it a try today and here's what I came up with:

import Data.List (tails) import Data.Map (Map) import qualified Data.Map as Map import Data.Array.Unboxed

searchRep :: String -> String -> String -> String searchRep src rp str = run (reverse $ take len1 str) $ drop len1 str where len = length src len1 = len-1 pat :: UArray Int Char pat = listArray (0,len1) src ch = pat!len1 badChar :: Map Char Int badChar = Map.fromList $ zip src [0 .. ] getBc c = case Map.lookup c badChar of Just n -> n Nothing -> -1 suffs :: UArray Int Int suffs = listArray (0,len1) $! init $! map (pr 0 crs) $! tails crs where crs = reverse src pr n (x:xs) (y:ys) | x == y = pr (n+1) xs ys pr n _ _ = n bmGs0 :: UArray Int Int bmGs0 = array (0,len1) [(j,k) | (k,k') <- zip (tail $! help) help, j <- [k' .. k-1]] help = [k | k <- [0 .. len], k == len || suffs!k == len-k] bmGs :: UArray Int Int bmGs = bmGs0 // [(len1-suffs!k,k) | k <- [len1,len-2 .. 1]] run by "" = reverse by run by (c:cs) | c == ch = process (c:by) cs | otherwise = run (c:by) cs roll n xs ys | n <= 0 = (xs, ys) roll n xs (y:ys) = roll (n-1) (y:xs) ys roll _ xs "" = (xs, "") walk n "" = (n,"") walk n st@(c:cs) | n < 0 = (n,st) | c == pat!n = walk (n-1) cs | otherwise = (n,st) process con left | i < 0 = reverse pass ++ rp ++ run "" left | otherwise = {- bye ++ -} run ncon nleft where (i,pass) = walk len1 con d = if null pass then i+1 else max (bmGs!i) (i - getBc (head pass)) -- bye = reverse $! drop (len-d) con (ncon,nleft) = roll (d-1) {- (take (len-d) con) -} con left

it's not as fast as KMP for the tests, but not too bad. Commenting out 'bye' gives a bit of extra speed, but if it's _long_ before a match (if any), we'd be better off relieving our memory with 'bye', I think.

Any improvements are welcome, certainly some of you can do much better.

It is fast on my machine except that you are using Map to lookup for badChar which is O(log n). I;ve placed this instead: badChar :: UArray Int Int badChar = array (0,255) ([(i,-1) | i <- [0..255]] ++ proc src 0) proc [] _ = [] proc (s:st) i = (ord s,i):proc st (i+1) getBc c = badChar ! ord c which gaved it significant boost, O(1) lookup. Now it's faster then brute force method but 10% slower then KMP with my test. I've also performed tests on dual Xeon linux box and results are proportionally the same as on my intel windows box. KMP wins again 10% better then BM and 20-30% better then straightforward search, which means that KMP is well suited for non indexed strings.

...

Cheers, Daniel

P.S. As an algorithm, I prefer KMP, it's quite elegant whereas BM is somewhat fussy.

Yes, BM is for indexed structures. Greetings, Bane. _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Daniel Fischer

9:22 a.m.

Am Freitag, 16. Dezember 2005 03:36 schrieben Sie:

...

From: Daniel Fischer

...
Any improvements are welcome, certainly some of you can do much better.

It is fast on my machine except that you are using Map to lookup for badChar which is O(log n). I;ve placed this instead: badChar :: UArray Int Int badChar = array (0,255) ([(i,-1) | i <- [0..255]] ++ proc src 0) proc [] _ = [] proc (s:st) i = (ord s,i):proc st (i+1) getBc c = badChar ! ord c

which gaved it significant boost, O(1) lookup.

Yes, but Char has 1114112 values, and I'm not sure whether such a large array would still be better, especially since, presumably, the Map will usually not be deeper than five layers, say. But if we restrict ourselves to extended ASCII Strings, an array certainly is better. And maybe, instead of using two arrays, bmGs0 and bmGs, a mutable array (those are DiffArrays, I think -- I'll check that out) would also improve it.

...

Now it's faster then brute force method but 10% slower then KMP with my test. I've also performed tests on dual Xeon linux box and results are proportionally the same as on my intel windows box. KMP wins again 10% better then BM and 20-30% better then straightforward search, which means that KMP is well suited for non indexed strings.

...
Cheers, Daniel

P.S. As an algorithm, I prefer KMP, it's quite elegant whereas BM is somewhat fussy.

Yes, BM is for indexed structures.

Greetings, Bane.

Cheers, Daniel

Bulat Ziganshin

1:51 p.m.

New subject: Re[2]: Substring replacements

Hello Branimir, Friday, December 16, 2005, 5:36:47 AM, you wrote: BM> I've also performed tests on dual Xeon linux box and results are just to let you know - GHC don't uses pentium4 hyperthreading, multiple cpus or multiple cores in these tests only way to make ghc using multiple processors is to use 6.5 beta version, compile with "-smp" and explicitly fork several threads -- Best regards, Bulat mailto:bulatz@HotPOP.com

Branimir Maksimovic

18 Dec 18 Dec

7:40 p.m.

New subject: Re[2]: Substring replacements

...

From: Bulat Ziganshin Reply-To: Bulat Ziganshin To: "Branimir Maksimovic" CC: daniel.is.fischer@web.de, Haskell-Cafe@haskell.org Subject: Re[2]: [Haskell-cafe] Substring replacements Date: Fri, 16 Dec 2005 16:51:32 +0300

Hello Branimir,

Friday, December 16, 2005, 5:36:47 AM, you wrote: BM> I've also performed tests on dual Xeon linux box and results are

just to let you know - GHC don't uses pentium4 hyperthreading, multiple cpus or multiple cores in these tests

only way to make ghc using multiple processors is to use 6.5 beta version, compile with "-smp" and explicitly fork several threads

You are right. I've double checked on linux there is just one thread executing and there is not such a big difference between KMP and straightforward search. Just about 10% KMP is faster with my test, but still faster. I've checked both SMP and non SMP linux (Intel). Hyperthreading effect is on windows only, I guess, as there are visible three threads per test process. I have one amd64 near (I'll check that one too, as soon as admin sets up account for me on that machine). Greetings, Bane. _________________________________________________________________ Don't just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/

ajb＠spamcop.net

15 Dec 15 Dec

5:25 a.m.

G'day all. Quoting Branimir Maksimovic :

...

After seeing that your program is fastest (I've also tried one from http://haskell.org/hawiki/RunTimeCompilation but perhaps I'm not that good in converting to search replace?)

You probably did it right, but you could post your version to the list if you want me to take a look. When I wrote the RunTimeCompilation code, it wasn't intended to be a shining example of efficiency, merely an illustration. Remember that it's doing TWO things: compiling the pattern to code, and then performing the search. The compilation phase is likely to be much slower than the search, so the speedup (if any!) would only be realised the SECOND time that you searched a string using the same pattern. (Assuming you re-used the compiled match code, of course!) Cheers, Andrew Bromage

Branimir Maksimovic

7:24 a.m.

...

From: ajb@spamcop.net To: haskell-cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Thu, 15 Dec 2005 00:25:19 -0500

G'day all.

Quoting Branimir Maksimovic :

...
After seeing that your program is fastest (I've also tried one from http://haskell.org/hawiki/RunTimeCompilation but perhaps I'm not that good in converting to search replace?)

You probably did it right, but you could post your version to the list if you want me to take a look.

Oh, here it is but just don;t laugh :) I've hacked with unsafePerformIO as I din't know how to remove IO from match any other way. searchReplaceKMP :: String->String->String -> String searchReplaceKMP sr rp s | not (null remaining) = found++rp ++ searchReplaceKMP sr rp remaining | otherwise = found where (found,remaining) = unsafePerformIO $ matchKMP sr s matchKMP :: (Monad m, Eq a) => [a] -> ([a] -> m ([a],[a])) matchKMP [] = error "Can't match empty list" matchKMP xs = matchfunc [] where matchfunc = makeMatchFunc [dofail] (zip xs (overlap xs)) dofail = \ps xs -> case xs of [] -> fail "can't match" (y:ys) -> matchfunc (y:ps) ys type PartialMatchFunc m a = [a] -> [a] -> m ([a], [a]) makeMatchFunc :: (Monad m, Eq a) => [PartialMatchFunc m a] -> [(a, Int)] -> PartialMatchFunc m a makeMatchFunc prev [] = \ps xs -> return (reverse (drop ((length prev)-1) ps), xs) makeMatchFunc prev ((x,failstate):ms) = thisf where mf = makeMatchFunc (thisf:prev) ms failcont = prev !! (length prev - failstate - 1) thisf = \ps xs -> case xs of [] -> fail "can't match" (y:ys) -> if (x == y) then mf (y:ps) ys else failcont ps xs overlap :: (Eq a) => [a] -> [Int] overlap str = overlap' [0] str where overlap' prev [] = reverse prev overlap' prev (x:xs) = let get_o o | o <= 1 || str !! (o-2) == x = o | otherwise = get_o (1 + prev !! (length prev - o + 1)) in overlap' (get_o (head prev + 1):prev) xs -------------------------------------------------------------------------------------- These are timings (it's performance is about the same as Rabin-Karp): $ time searchr.exe Working:seasearch replace able seaseasearch baker seasearch charlie searchr.exe: user error (can't match) real 0m22.187s user 0m0.015s sys 0m0.015s bmaxa@MAXA ~/tutorial $ ghc -fglasgow-exts -O2 searchr.hs --make -o searchr.exe Chasing modules from: searchr.hs Compiling Main ( searchr.hs, searchr.o ) Linking ... bmaxa@MAXA ~/tutorial $ time searchr.exe Working very long True Done real 0m8.110s user 0m0.031s sys 0m0.016s

...

When I wrote the RunTimeCompilation code, it wasn't intended to be a shining example of efficiency, merely an illustration. Remember that it's doing TWO things: compiling the pattern to code, and then performing the search. The compilation phase is likely to be much slower than the search, so the speedup (if any!) would only be realised the SECOND time that you searched a string using the same pattern. (Assuming you re-used the compiled match code, of course!)

Oh, that explaines it. Actually this has to be converted to searchReplace in order to be fast, but I don;t know how (yet) as your program is pretty complicated to my humble Haskell skills. I think that your technique can be usefull with Aho-Corasick algorithm as it first constructs finite automaton from tree, then performs search. So, I'll guess I'll try first Boyer-Moore, then Aho-Corasick, eventually run time compilation, but this is too advanced for me for now. Greetings, Bane. _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Branimir Maksimovic

12 Dec 12 Dec

4:56 p.m.

...

From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Mon, 12 Dec 2005 16:15:46 +0100

Earlier today:

...
Sorry, but Prelude SearchRep> searchReplace "abaaba" "##" "abababaaba" "abababaaba"

I haven't analyzed the algorithm, so I don't know why exactly this fails. I'll take a look sometime soon.

I found the problem (one at least). Say the pattern to be replaced begins with 'a' and we have a sufficiently long match with the pattern starting at the first 'a' in the String. Upon encountering the second 'a', while the first pattern still matches, you start pushing onto the rollback-stack. But that isn't inspected anymore, so if the actual occurence of the pattern starts at the third (or fourth, n-th) occurence of 'a' and that is already pushed onto the rollback, you miss it.

let src = concat (replicate n "abc") ++ "d" let str = concat (replicate (n+k) "abc") ++ "d" then searchReplace src "Success!" str will work correctly iff k is congruent to 0 or 1 modulo (n+1).

Oh, yes this seems the problem for searchr :( I have to look for efficient way in order to circumvent repeated searches. But since your KMP is fastest of all now, I am considering if there is any point now to correct this. And searchr ugly version that I've posted has a bug (not present in MyBane pretty version) . should be : else if sr'/=x Greetings, Bane. _________________________________________________________________ FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/

Branimir Maksimovic

3:19 p.m.

...

From: Daniel Fischer To: "Branimir Maksimovic" CC: Haskell-Cafe@haskell.org Subject: Re: [Haskell-cafe] Substring replacements Date: Mon, 12 Dec 2005 13:07:29 +0100

Sorry, but Prelude SearchRep> searchReplace "abaaba" "##" "abababaaba" "abababaaba"

I haven't analyzed the algorithm, so I don't know why exactly this fails. I'll take a look sometime soon.

It failed because I didn;t adjusted search string for rollBack when previous rollBack is not null. this is corrected version: (with your changes it looks much better) ------------------------------------------------------------------------------- searchReplace :: String->String->String -> String searchReplace "" _ xs = xs searchReplace sr rp xs = searchr sr rp xs "" "" where searchr :: String->String->String->String->String -> String searchr _ _ "" _ _ = "" searchr sr rp xs retB rollB | found = rp ++ searchr sr rp rema ret roll | otherwise = reverse (proc ++ rollB) ++ searchr sr rp rema ret roll where (found, proc, rema, ret, roll) = searchr' sr sr (reverse retB ++ xs) "" rollB searchr' src@(s:sr) src'@(s':sr') xs soFar rollB = searchr'' (drop (length rollB) src) src' xs soFar (not (null rollB),"","") s searchr'' "" _ xs fnd _ _ = (True,fnd,xs,"","") searchr'' _ _ "" fnd (_,ret,roll) _ = (False,ret++roll++fnd,"","","") searchr'' src@(s:sr) src'@(s':sr') xxs@(x:xs) soFar (cnt,ret,roll) c | s == x = if s' == x && null ret && cnt then searchr'' sr sr' xs soFar (True, "", x:roll) c else if null ret && null roll then searchr'' sr src' xs (x:soFar) (True, "", "") c else searchr'' sr src' xs soFar (True, x:roll++ret, "") c | otherwise = if null roll && null ret then if c == x then (False, soFar, xxs, "", "") else let (from, pre) = break (==c) xs in (False, reverse from ++ x:soFar, pre, "", "") else if s'/=x then if null ret then (False, (x:roll) ++ soFar, xs,"","") else (False, soFar, xxs,ret,"") else if null ret then (False, soFar, xs, "", x:roll) else (False, soFar, xxs, ret, "") ---------------------------------------------------------------------------- However it is significantly slower then previous ugly version: searchReplace :: String->String->String -> String searchReplace sr rp xs = searchr sr rp xs "" "" where searchr :: String->String->String->String->String -> String searchr [] _ xs _ _ = xs searchr _ _ [] _ _ = [] searchr sr rp xs retBack rollBack | isFound $ fnd rollBack = rp ++ searchr sr rp (remaining $ fnd rollBack ) ( getRetBack $ fnd rollBack) ( getRollBack $ fnd rollBack) | otherwise = reverse ((processed $ fnd rollBack) ++ rollBack) ++ searchr sr rp (remaining $ fnd rollBack) ( getRetBack $ fnd rollBack) ( getRollBack $ fnd rollBack) where fnd = searchr' sr sr (reverse retBack ++ xs) "" isFound = fst . fst remaining = snd . snd . fst getRollBack = snd . snd getRetBack = fst . snd processed = fst . snd . fst searchr' :: String->String->String->String->String -> ((Bool,(String,String)),(String,String)) searchr' srch@(sr:srs) srch'@(sr':srs') xs fndSoFar rollBack = searchr'' (drop (length rollBack) srch) srch' xs fndSoFar (not (isEmpty rollBack),"","") sr searchr'' :: String->String->String->String->(Bool,String,String)->Char -> ((Bool,(String,String)),(String,String)) searchr'' [] _ xs fnd _ _ = ((True,(fnd,xs)),("","")) searchr'' _ _ [] fnd (_,retBack,rollBack) _ = ((False,(retBack ++ rollBack ++ fnd,[])),("","")) searchr'' srch@(sr:srs) srch'@(sr':srs') xxs@(x:xs) fndSoFar (cnt,retBack,rollBack) s | sr == x = if cnt && sr' == x && isEmpty retBack then searchr'' srs srs' xs fndSoFar (True,"",x:rollBack) s else if not (isEmpty retBack) || not (isEmpty rollBack) then searchr'' srs srch' xs fndSoFar (True,(x:rollBack) ++ retBack,"") s else searchr'' srs srch' xs (x:fndSoFar) (True,"","") s | otherwise = if isEmpty rollBack && isEmpty retBack then if s == x then ((False,(fndSoFar,xxs)),("","")) else ((False,searchr''' s xs (x:fndSoFar)),("","")) else if sr/=x then if isEmpty retBack then ((False,((x:rollBack)++fndSoFar, xs)), ("","")) else ((False,(fndSoFar, xxs)), (retBack,"")) else if isEmpty retBack then ((False,(fndSoFar, xs)), ("",x:rollBack)) else ((False,(fndSoFar, xxs)), (retBack,"")) searchr''' :: Char->String->String -> (String,String) searchr''' sr [] fndSoFar = (fndSoFar,[]) searchr''' sr xxs@(x:xs) fndSoFar | sr/=x = searchr''' sr xs (x:fndSoFar) | otherwise = (fndSoFar,xxs) isEmpty [] = True isEmpty (a:as) = False ------------------------------------------------------------------------------- these are timings: $ time MyBane.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done real 0m19.984s user 0m0.015s sys 0m0.016s bmaxa@MAXA ~/tutorial $ time searchr.exe Working:seasearch replace able seaseasearch baker seasearch charlie True Done real 0m13.719s user 0m0.015s sys 0m0.000s It's 6 seconds difference. Your KMP is always fastest of all version in any combination on 2gb p4 3ghz hyperthreaded windows. Same results on linux will be I guess with same machine as I;ve previously tested on linux but some other p4 and results were proportionally the same. Greetings, Bane. _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

7142

Age (days ago)

7148

Last active (days ago)

List overview

Download

25 comments

5 participants

participants (5)

ajb＠spamcop.net
Branimir Maksimovic
Bulat Ziganshin
Daniel Fischer
Sebastian Sylvan