Re: [Haskell-cafe] Substring replacements

13 Dec 2005

      Am Montag, 12. Dezember 2005 16:28 schrieben Sie:
...
From: Daniel Fischer 
...
To: "Branimir Maksimovic" 
CC: Haskell-Cafe@haskell.org
Subject: Re: [Haskell-cafe] Substring replacements
Date: Mon, 12 Dec 2005 16:15:46 +0100
Earlier today:
...
Sorry, but
Prelude SearchRep> searchReplace "abaaba" "##" "abababaaba"
"abababaaba"
I haven't analyzed the algorithm, so I don't know why exactly this
fails.
...
I'll take a look sometime soon.
I found the problem (one at least).
Say the pattern to be replaced begins with 'a' and we have a sufficiently
long
match with the pattern starting at the first 'a' in the String. Upon
encountering the second 'a', while the first pattern still matches, you
start
pushing onto the rollback-stack. But that isn't inspected anymore, so if
the
actual occurence of the pattern starts at the third (or fourth, n-th)
occurence of 'a' and that is already pushed onto the rollback, you miss
it.
I've corrected this with adjusting rollback position. if rollBack is null
then
search for rollback starts at second character if not starts at same as
searhed
character because I skip what was searched. That's all.
Though I'm not so sure now when I read this.
Still not working:

*New> searchReplace "abababc" "#" "ababababababc"
"ababababababc"
*New> searchReplace1 "abababc" "#" "ababababababc"
"ababababababc"
...
...
So the question is, can we find a cheap test to decide whether to use KMP
or
Bulat's version?
In real world situation your KMP will always be fastest on average.
I like that we are not using C arrays as then we have advantage
of lazyness and save on memory usage. C++ program will be faster
on shorter strings but on this large strings will loose due memory
latency. and with your test, both programs are very fast.
Greetings, Bane.
On my 256MB RAM AMD Duron 1200 MHz, Bulat's version is consistently about 20% 
faster than my KMP on your test -- btw, I unboxed the pat array, which gave a 
bit of extra speed, but not much. 
And apologies to Sebastian Sylvan, I also included an unboxed version of bord, 
built from the boxed version, and that sped things further up -- not much, 
again, but there it is.
I wonder about this difference, -10% on one system and +20% on another system, 
ist that normal?

Cheers, Daniel
------------------------------------------------------------------------------------------
Up-To-Date version of KMP:

import Data.Array.Unboxed (UArray, listArray, (!))
import qualified Data.Array as A (array, (!), elems)

searchReplace :: String -> String -> String -> String
searchReplace "" _ str = str
searchReplace src@(c:cs) dst str = process 0 str ""
    where
      len = {-# scc "len" #-} length src
      pat :: UArray Int Char
      pat = {-# scc "pat" #-} listArray (0,len-1) src
      bord ={-# scc "bord" #-} A.array (0,len) $ (0,-1):(1,0):
                 [(i+1,getBord (pat!i) i + 1) | i <- [1 .. len-1]]
      getBord s n
         | m < 0      = m
	 | s == pat!m = m
	 | otherwise  = getBord s m
	   where
	     m = bord A.! n
      bor :: UArray Int Int
      bor = listArray (0,len) $ A.elems bord
      getBor s n
         | m < 0 || s == pat!m = m
	 | otherwise = getBor s m
	   where
	     m = bor!n
      process n str _ | n >= len = {-# scc "process" #-} dst ++ process 0 str 
""
      process _ "" mat = {-# scc "process" #-} reverse mat
      process 0 (s:st) _
         | s == c    = {-# scc "process" #-} process 1 st [s]
	 | otherwise = {-# scc "process" #-} s:process 0 st ""
      process n str@(s:st) mat
         | s == pat!n = {-# scc "process" #-} process (n+1) st (s:mat)
	 | otherwise  = {-# scc "process" #-}
	                let j = getBor s n
	                    (ret,skip) = splitAt j mat
			in if j < 0 then reverse mat ++ process 0 str ""
			   else reverse skip ++ process (j+1) st (s:ret)