
From: Jared Updike [mailto:jupdike@gmail.com]
Ian's LCS code looks like it works fine in terms of space usage, but for large input (lengrh as == 40000, length bs == 50000) it seems to be way too slow in terms of time complexity (GNU sdiff executes on this same data in a few seconds, the Hunt-Szymanski LCS algorithm didn't terminate, even after like 10 minutes).
If it's not too much trouble, could you send Myers STArray based code?
Thanks, Jared.
(Sorry for the late reply; have been on holiday.) Included below. I used to have a unit test module, and also a pure DiffArray-based implementation for performance comparison (it was sloooow), but I lost them when my laptop died (a harsh lesson in infrequent backups), so all I have left is the implementation module now. I've used it to diff fairly large files (hundreds of K's, if not Megs) where there were few differences. It seemed to perform OK, and in cases where GNU diff (or whatever comes with MSYS) failed. Alistair module Diff (diff, diffArray, diffSTArray, Modification(..)) where {- See: http://www.eecs.berkeley.edu/~gene/Papers/np_diff.pdf http://www.cs.arizona.edu/~gene/PAPERS/np_diff.ps (An O(NP) sequence comparison algorithm) The algorithm in the paper is: compare(src, dst) M := length src; N := length dst; delta := N - M; fp[-(M+1)..(N+1)] := -1; p := -1; repeat p++; for k := -p to delta-1 do fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1])); for k := delta+p downto delta+1 by -1 do fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1])); k := delta; fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1])); until fp[delta] = N print "edit distance " + (delta + 2*p) end; snake(k, y) x := y - k; while x < M and y < N and src[x+1] = B[y+1] do x := x + 1; y := y + 1; return y; end; Notes: 1. In the paper, y refers to the column position, and x is the row number. This is not conventional! i.e. x and y are swapped. In this implementation we use x and y in the conventional manner, and store the x value in the vertex array. 3. The algorithm as stated in the paper does not indicate how to determine when a move down or right occurs. I've assumed it's any update to a diagonal, where the new x or y value is within the dst or src bounds. 4. We need to set vtx[0] = 0. This handles the case where the destination string is empty. In this case, the first move should be a delete i.e. a move down. If vtx[0] = -1 then the nextEdit function would choose a move right to diagonal 0 (this would put us further along the diagonal than a move down), but we only want to move down, so this is the wrong move. Hence we set vtx[0] = 0. Note that the algorithm appears to work for all other cases (including empty src and non-empty dst) when vtx[0] is either 0 or -1. All other entries of vtx[] are initialised to -1. 5. The algorithm in the paper only handles the case where length dst >= length src i.e. where delta >= 0. How do we modify it to handle delta < 0? The rules for generating digaonals for delta >= 0 are: [-p up-to d-1] (below delta) [p+d downto d+1] (above delta) [d] And the rules for delta < 0: [p downto d+1] (above delta) [-d-p up-to d-1] (below delta) [d] These are derived from the "recurrence relation" which is: x[k,p] = k < delta | snake( max( x[k-1,p ] + 1, x[k+1,p-1] ) ) k > delta | snake( max( x[k-1,p ] + 1, x[k+1,p ] ) ) k = delta | snake( max( x[k-1,p-1] + 1, x[k+1,p ] ) ) Look at patterns for examining diagonals: delta = 0: 0 | p=0: 0 p=1: -1,1,0 p=2: -2,-1,2,1,0 p=3: -3,-2,-1,3,2,1,0 p=4: -4,... delta > 0: 1 | p=0: 0,1 p=1: -1,0,2,1 p=2: -2,-1,0,3,2,1 p=3: -3,-2,-1,0,4,3,2,1 p=3: -4,... 2 | p=0: 0,1,2 p=1: -1,0,1,3,2 p=2: -2,-1,0,1,4,3,2 p=3: -3,... First iteration goes from 0 to delta (outside in). Let's see what pattern is like for delta < 0. delta < 0 -1| p=0: 0,-1 p=1: 1,0,-2,-1 p=2: 2,1,0,-3,-2,-1 -2| p=0: 0,-1,-2 p=1: 1,0,-2,-1 p=2: 2,1,0,-3,-2,-1 -} import Data.Array.ST as STA import Data.Array import Control.Monad.ST data Modification a = Ins !Int !a | Del !Int !a deriving (Show, Eq) inBounds arr i = case STA.bounds arr of (l, u) -> i >= l && i <= u upperBound arr = case STA.bounds arr of (_, u) -> u isUpperBound arr i = i == upperBound arr -- list interface diff :: Eq a => [a] -> [a] -> [Modification a] diff [] [] = [] diff src dst = runST ( do srca <- makeArray1 src dsta <- makeArray1 dst diffSTArray srca dsta ) -- Array interface diffArray :: Eq a => Array Int a -> Array Int a -> [Modification a] diffArray src dst = runST ( do -- We can use unsafeThaw because we don't modify src or dst srca <- unsafeThaw src dsta <- unsafeThaw dst diffSTArray srca dsta ) -- STArray interface diffSTArray :: Eq a => STArray s Int a -> STArray s Int a -> ST s [Modification a] diffSTArray src dst = do let boundLower = 1 + upperBound src boundUpper = 1 + upperBound dst delta = boundUpper - boundLower vtx <- makeArray boundLower boundUpper (-1) writeArray vtx 0 0 edt <- makeArray boundLower boundUpper [] diffLoop src dst vtx edt 0 delta -- Need to help type-checker with array types makeArray :: Int -> Int -> a -> ST s (STArray s Int a) makeArray lower upper val = newArray (-lower, upper) val makeArray1 :: [a] -> ST s (STArray s Int a) makeArray1 str = newListArray (1, length str) str diffLoop src dst vtx edt p delta = do updateForP src dst vtx edt p delta fin <- finished dst vtx delta if fin then do l <- readArray edt delta return (reverse l) else diffLoop src dst vtx edt (p+1) delta finished dst vtx delta = do x <- readArray vtx delta return (isUpperBound dst x) makeDiagsIncr from to = [from..to] makeDiagsDecr from to = reverse [to..from] -- process diagonals within p-band in certain order: -- 1. those below diagonal delta, from outside in -- 2. those above diagonal delta, from outside in -- 3. diagonal delta updateForP src dst vtx edt p delta = if delta >= 0 then do forDiagonals src dst vtx edt (makeDiagsIncr (-p) (delta-1)) forDiagonals src dst vtx edt (makeDiagsDecr (p+delta) (delta+1)) forDiagonals src dst vtx edt [delta] else do forDiagonals src dst vtx edt (makeDiagsDecr p (delta+1)) forDiagonals src dst vtx edt (makeDiagsIncr (-delta-p) (delta-1)) forDiagonals src dst vtx edt [delta] forDiagonals src dst vtx edits [] = return () forDiagonals src dst vtx edits (d:diags) = do nextEdit src dst vtx edits d forDiagonals src dst vtx edits diags -- Move down is delete, move right is insert. -- Note that the position reported for Insert is relative to the -- src (y) axis, not the dst axis. -- This is because we expect to report the edit script as changes -- relative to the src sequence. -- type-sig is a big help with space usage - squashes unnecessary polymorphism. nextEdit :: Eq a => STArray s Int a -> STArray s Int a -> STArray s Int Int -> STArray s Int [Modification a] -> Int -> ST s () nextEdit src dst vtx edits diag = do mr <- readArray vtx (diag-1) moveDown <- readArray vtx (diag+1) let moveRight = 1 + mr x = if moveDown >= moveRight then moveDown else moveRight y = x - diag newX <- slideDownDiagonal src dst x diag writeArray vtx diag newX if moveDown >= moveRight then do editList <- makeEdit src y Del y edits (diag+1) writeArray edits diag editList else do editList <- makeEdit dst x Ins (y+1) edits (diag-1) writeArray edits diag editList makeEdit :: Eq a => STArray s Int a -> Int -> (Int -> a -> Modification a) -> Int -> STArray s Int [Modification a] -> Int -> ST s [Modification a] makeEdit str i cons y edits diag = do es <- readArray edits diag if inBounds str i then do e <- readArray str i return ((cons y e):es) else return es slideDownDiagonal :: Eq a => STArray s Int a -> STArray s Int a -> Int -> Int -> ST s Int slideDownDiagonal src dst x' dia = do let x = x' + 1; y = x - dia if (inBounds dst x) && (inBounds src y) then do a <- readArray dst x; b <- readArray src y if a == b then slideDownDiagonal src dst x dia else return x' else return x' ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************