Memoizing longest-common-subsequence

I wanted a longest common subsequence function and a bit of Googling failed to turn up a functional one, except for in a scary bit of darcs. So, I tried making a memoized functional version of the LCS delta algorithm on the problem's Wikipedia page. It's not the fastest, but it's simple and should be polynomial, growing with the product of input sequence lengths. I've not played with unboxing or strictness here yet, but with the help of Data.Array I have: longestCommonSubsequence xs ys = let lastIndex = (length xs, length ys) memo = array ((0, 0), lastIndex) [ ((xi, yi), deltaLCS (xi, x) (yi, y)) | (xi, x) <- zip [0..] (undefined : xs), (yi, y) <- zip [0..] (undefined : ys) ] deltaLCS (0, _) _ = (0, []) deltaLCS _ (0, _) = (0, []) deltaLCS (xl, x) (yl, y) = if (x == y) then let (xylShrunk, xysShrunk) = memo ! (xl-1, yl-1) in (xylShrunk + 1, x : xysShrunk) else let xaShrunk@(xlShrunk, _) = memo ! (xl-1, yl) yaShrunk@(ylShrunk, _) = memo ! (xl, yl-1) in if xlShrunk > ylShrunk then xaShrunk else yaShrunk in reverse (snd (memo ! lastIndex)) I haven't looked much at optimizing this further so there are probably yet some easy wins to be had - I'd be interested to see what they are. Still, I thought it worth sharing as an example of laziness making memoization easy in a simple functional longest common subsequence example. Take this as your cue to point out the much better LCS algorithm that already exists in the standard libraries, that I couldn't find. (-: -- Mark

Mark T.B. Carroll wrote:
Take this as your cue to point out the much better LCS algorithm that already exists in the standard libraries, that I couldn't find. (-:
I don't know of a version in the libraries, but since you mentioned you were unsuccessful looking for any functional algorithms solving this problem: I would be very surprised if none could be found in some paper of Richard Bird's. Also, something like the memoizing implementation you give should be easily in reach using Giegerich and Kurtz' "algebraic dynamic programming" approach that I recently mentioned here. Of course, it would then be less straightforward to apply subsequent optimizations (like strictifying and unboxing), as the details of the implementation would be hidden inside their DSL. Ciao, Janis. -- Dr. Janis Voigtlaender http://wwwtcs.inf.tu-dresden.de/~voigt/ mailto:voigt@tcs.inf.tu-dresden.de

Mark.Carroll@Aetion.com (Mark T.B. Carroll) writes:
I wanted a longest common subsequence function and a bit of Googling failed to turn up a functional one, except for in a scary bit of darcs.
The code in darcs is a translation of the example code in the Eugene Myers paper mentioned in the comments and the C code in GNU diff. I didn't try much to simplify it because I first wanted to see how close I could come to the perfomance of the C code. [..]
Take this as your cue to point out the much better LCS algorithm that already exists in the standard libraries, that I couldn't find. (-:
There is a nice implementation of the Hunt-Szymanski algorithm by Ian Lynagh at http://urchin.earth.li/darcs/ian/lcs/. Benedikt

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of
Mark.Carroll@Aetion.com (Mark T.B. Carroll) writes:
I wanted a longest common subsequence function and a bit of Googling failed to turn up a functional one, except for in a scary bit of darcs.
The code in darcs is a translation of the example code in the Eugene Myers paper mentioned in the comments and the C code in GNU diff. I didn't try much to simplify it because I first wanted to see how close I could come to the perfomance of the C code.
[..]
Take this as your cue to point out the much better LCS algorithm that already exists in the standard libraries, that I couldn't find. (-:
There is a nice implementation of the Hunt-Szymanski algorithm by Ian Lynagh at http://urchin.earth.li/darcs/ian/lcs/.
Benedikt
I also had a go at this a while ago, using this Eugene Myers' paper as the specification: http://www.cs.arizona.edu/~gene/PAPERS/np_diff.ps which claims improvement over the one you mention above: http://www.cs.arizona.edu/~gene/PAPERS/diff.ps and also ended up with an STArray based implementation. I can send the code if you're interested. I have no idea how well it performs compared to Ian's, or the one in darcs (which uses PackedStrings). There's some kind of summary/comparison of LCS algorithms (including Hunt-Szymanski) here: http://www-math.mit.edu/~lippert/18.417/papers/mye:1991.pdf also by Eugene Myers. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

and also ended up with an STArray based implementation. I can send the code if you're interested. I have no idea how well it performs compared to Ian's, or the one in darcs (which uses PackedStrings).
Ian's LCS code looks like it works fine in terms of space usage, but for large input (lengrh as == 40000, length bs == 50000) it seems to be way too slow in terms of time complexity (GNU sdiff executes on this same data in a few seconds, the Hunt-Szymanski LCS algorithm didn't terminate, even after like 10 minutes). If it's not too much trouble, could you send Myers STArray based code? Thanks, Jared. -- http://www.updike.org/~jared/ reverse ")-:"

From: Jared Updike [mailto:jupdike@gmail.com]
Ian's LCS code looks like it works fine in terms of space usage, but for large input (lengrh as == 40000, length bs == 50000) it seems to be way too slow in terms of time complexity (GNU sdiff executes on this same data in a few seconds, the Hunt-Szymanski LCS algorithm didn't terminate, even after like 10 minutes).
If it's not too much trouble, could you send Myers STArray based code?
Thanks, Jared.
(Sorry for the late reply; have been on holiday.) Included below. I used to have a unit test module, and also a pure DiffArray-based implementation for performance comparison (it was sloooow), but I lost them when my laptop died (a harsh lesson in infrequent backups), so all I have left is the implementation module now. I've used it to diff fairly large files (hundreds of K's, if not Megs) where there were few differences. It seemed to perform OK, and in cases where GNU diff (or whatever comes with MSYS) failed. Alistair module Diff (diff, diffArray, diffSTArray, Modification(..)) where {- See: http://www.eecs.berkeley.edu/~gene/Papers/np_diff.pdf http://www.cs.arizona.edu/~gene/PAPERS/np_diff.ps (An O(NP) sequence comparison algorithm) The algorithm in the paper is: compare(src, dst) M := length src; N := length dst; delta := N - M; fp[-(M+1)..(N+1)] := -1; p := -1; repeat p++; for k := -p to delta-1 do fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1])); for k := delta+p downto delta+1 by -1 do fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1])); k := delta; fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1])); until fp[delta] = N print "edit distance " + (delta + 2*p) end; snake(k, y) x := y - k; while x < M and y < N and src[x+1] = B[y+1] do x := x + 1; y := y + 1; return y; end; Notes: 1. In the paper, y refers to the column position, and x is the row number. This is not conventional! i.e. x and y are swapped. In this implementation we use x and y in the conventional manner, and store the x value in the vertex array. 3. The algorithm as stated in the paper does not indicate how to determine when a move down or right occurs. I've assumed it's any update to a diagonal, where the new x or y value is within the dst or src bounds. 4. We need to set vtx[0] = 0. This handles the case where the destination string is empty. In this case, the first move should be a delete i.e. a move down. If vtx[0] = -1 then the nextEdit function would choose a move right to diagonal 0 (this would put us further along the diagonal than a move down), but we only want to move down, so this is the wrong move. Hence we set vtx[0] = 0. Note that the algorithm appears to work for all other cases (including empty src and non-empty dst) when vtx[0] is either 0 or -1. All other entries of vtx[] are initialised to -1. 5. The algorithm in the paper only handles the case where length dst >= length src i.e. where delta >= 0. How do we modify it to handle delta < 0? The rules for generating digaonals for delta >= 0 are: [-p up-to d-1] (below delta) [p+d downto d+1] (above delta) [d] And the rules for delta < 0: [p downto d+1] (above delta) [-d-p up-to d-1] (below delta) [d] These are derived from the "recurrence relation" which is: x[k,p] = k < delta | snake( max( x[k-1,p ] + 1, x[k+1,p-1] ) ) k > delta | snake( max( x[k-1,p ] + 1, x[k+1,p ] ) ) k = delta | snake( max( x[k-1,p-1] + 1, x[k+1,p ] ) ) Look at patterns for examining diagonals: delta = 0: 0 | p=0: 0 p=1: -1,1,0 p=2: -2,-1,2,1,0 p=3: -3,-2,-1,3,2,1,0 p=4: -4,... delta > 0: 1 | p=0: 0,1 p=1: -1,0,2,1 p=2: -2,-1,0,3,2,1 p=3: -3,-2,-1,0,4,3,2,1 p=3: -4,... 2 | p=0: 0,1,2 p=1: -1,0,1,3,2 p=2: -2,-1,0,1,4,3,2 p=3: -3,... First iteration goes from 0 to delta (outside in). Let's see what pattern is like for delta < 0. delta < 0 -1| p=0: 0,-1 p=1: 1,0,-2,-1 p=2: 2,1,0,-3,-2,-1 -2| p=0: 0,-1,-2 p=1: 1,0,-2,-1 p=2: 2,1,0,-3,-2,-1 -} import Data.Array.ST as STA import Data.Array import Control.Monad.ST data Modification a = Ins !Int !a | Del !Int !a deriving (Show, Eq) inBounds arr i = case STA.bounds arr of (l, u) -> i >= l && i <= u upperBound arr = case STA.bounds arr of (_, u) -> u isUpperBound arr i = i == upperBound arr -- list interface diff :: Eq a => [a] -> [a] -> [Modification a] diff [] [] = [] diff src dst = runST ( do srca <- makeArray1 src dsta <- makeArray1 dst diffSTArray srca dsta ) -- Array interface diffArray :: Eq a => Array Int a -> Array Int a -> [Modification a] diffArray src dst = runST ( do -- We can use unsafeThaw because we don't modify src or dst srca <- unsafeThaw src dsta <- unsafeThaw dst diffSTArray srca dsta ) -- STArray interface diffSTArray :: Eq a => STArray s Int a -> STArray s Int a -> ST s [Modification a] diffSTArray src dst = do let boundLower = 1 + upperBound src boundUpper = 1 + upperBound dst delta = boundUpper - boundLower vtx <- makeArray boundLower boundUpper (-1) writeArray vtx 0 0 edt <- makeArray boundLower boundUpper [] diffLoop src dst vtx edt 0 delta -- Need to help type-checker with array types makeArray :: Int -> Int -> a -> ST s (STArray s Int a) makeArray lower upper val = newArray (-lower, upper) val makeArray1 :: [a] -> ST s (STArray s Int a) makeArray1 str = newListArray (1, length str) str diffLoop src dst vtx edt p delta = do updateForP src dst vtx edt p delta fin <- finished dst vtx delta if fin then do l <- readArray edt delta return (reverse l) else diffLoop src dst vtx edt (p+1) delta finished dst vtx delta = do x <- readArray vtx delta return (isUpperBound dst x) makeDiagsIncr from to = [from..to] makeDiagsDecr from to = reverse [to..from] -- process diagonals within p-band in certain order: -- 1. those below diagonal delta, from outside in -- 2. those above diagonal delta, from outside in -- 3. diagonal delta updateForP src dst vtx edt p delta = if delta >= 0 then do forDiagonals src dst vtx edt (makeDiagsIncr (-p) (delta-1)) forDiagonals src dst vtx edt (makeDiagsDecr (p+delta) (delta+1)) forDiagonals src dst vtx edt [delta] else do forDiagonals src dst vtx edt (makeDiagsDecr p (delta+1)) forDiagonals src dst vtx edt (makeDiagsIncr (-delta-p) (delta-1)) forDiagonals src dst vtx edt [delta] forDiagonals src dst vtx edits [] = return () forDiagonals src dst vtx edits (d:diags) = do nextEdit src dst vtx edits d forDiagonals src dst vtx edits diags -- Move down is delete, move right is insert. -- Note that the position reported for Insert is relative to the -- src (y) axis, not the dst axis. -- This is because we expect to report the edit script as changes -- relative to the src sequence. -- type-sig is a big help with space usage - squashes unnecessary polymorphism. nextEdit :: Eq a => STArray s Int a -> STArray s Int a -> STArray s Int Int -> STArray s Int [Modification a] -> Int -> ST s () nextEdit src dst vtx edits diag = do mr <- readArray vtx (diag-1) moveDown <- readArray vtx (diag+1) let moveRight = 1 + mr x = if moveDown >= moveRight then moveDown else moveRight y = x - diag newX <- slideDownDiagonal src dst x diag writeArray vtx diag newX if moveDown >= moveRight then do editList <- makeEdit src y Del y edits (diag+1) writeArray edits diag editList else do editList <- makeEdit dst x Ins (y+1) edits (diag-1) writeArray edits diag editList makeEdit :: Eq a => STArray s Int a -> Int -> (Int -> a -> Modification a) -> Int -> STArray s Int [Modification a] -> Int -> ST s [Modification a] makeEdit str i cons y edits diag = do es <- readArray edits diag if inBounds str i then do e <- readArray str i return ((cons y e):es) else return es slideDownDiagonal :: Eq a => STArray s Int a -> STArray s Int a -> Int -> Int -> ST s Int slideDownDiagonal src dst x' dia = do let x = x' + 1; y = x - dia if (inBounds dst x) && (inBounds src y) then do a <- readArray dst x; b <- readArray src y if a == b then slideDownDiagonal src dst x dia else return x' else return x' ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

(Sorry for the late reply; have been on holiday.)
No problem. Your email system was kind enough to say when you'd be back :-)
I've used it to diff fairly large files (hundreds of K's, if not Megs) where there were few differences. It seemed to perform OK, and in cases where GNU diff (or whatever comes with MSYS) failed.
Thanks for posting the code. It works on pretty large data sets (for example, a thousand Strings each) and I have a hunch that if I use Data.ByteString it would even work fast enough on my quarter meg text files (split on words, ~40,000 and ~50,000 words each) to use in place of GNU sdiff or diff. Did you use FastPackedString or ByteString to get performance you alluded to? I'll return with the results of my experiments with ByteString and Diff, although I imagine it should be pretty fast since darcs is able to get acceptable speed on large datasets using (I think) ByeString and a Haskell implementation of Myers diff. Cheers, Jared. -- http://www.updike.org/~jared/ reverse ")-:"

From: Jared Updike [mailto:jupdike@gmail.com]
Thanks for posting the code. It works on pretty large data sets (for example, a thousand Strings each) and I have a hunch that if I use Data.ByteString it would even work fast enough on my quarter meg text files (split on words, ~40,000 and ~50,000 words each) to use in place of GNU sdiff or diff. Did you use FastPackedString or ByteString to get performance you alluded to?
You'll notice that the primary interface is: diff :: Eq a => [a] -> [a] -> [Modification a] So it's not optimised for any kind of String input. The performance I'm alluding to is not something I've tested with any rigor, or even metrics to support. I simply needed to compare a couple of large text files which GNU diff didn't handle (I think when the input is over a certain size it breaks it into chunks and diffs the chunks). I tried this code, and it did the job in a reasonable time (my PC has 1G of memory, and that helps when the input lists have to reside entirely in memory). That was using naïve String-IO too, so there might well be some mileage in a FPS-specific implementation. I profiled the code by generating large inputs in memory, and then invoking the various interfaces, so as to isolate the testing from IO performance. And that was just to find the performance hotspots in the algorithm, not to compare against other diff implementations. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

From: Bayley, Alistair
The performance I'm alluding to is not something I've tested with any rigor, or even metrics to support. I simply needed to compare a couple of large text files which GNU diff didn't handle (I think when the input is over a certain size it breaks it into chunks and diffs the chunks).
Sorry, I'm telling complete porkies here... I've found the folder in which I did some of this testing, and GNU diff has no problem with the input files; they're only 7M, My program spends 70% of its time doing String-IO (so 30% in the algorithm), and peaks at about 350M of memory (which seems quite high, but then it does convert the String input into STArrays). The algorithm itself takes about 18secs on this input (wall-clock time), but the profile says a lot less CPU time is used. I think using ByteStrings would be a big improvement; maybe I'll find time to try that later. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Hello Alistair, Wednesday, August 23, 2006, 1:43:30 PM, you wrote:
I've found the folder in which I did some of this testing, and GNU diff has no problem with the input files; they're only 7M, My program spends 70% of its time doing String-IO (so 30% in the algorithm), and peaks at about 350M of memory (which seems quite high, but then it does convert the String input into STArrays). The algorithm itself takes about 18secs on this input (wall-clock time), but the profile says a lot less CPU time is used.
I think using ByteStrings would be a big improvement; maybe I'll find time to try that later.
as variant, you can try streams library - it's several times faster for line-oriented I/O. another solution that can also speed up your code, although not so much, is to use hGetContents and then 'lines' http://haskell.org/haskellwiki/Library/Streams http://www.haskell.org/library/StreamsBeta.tar.gz -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 8/1/06, Mark T.B. Carroll
I wanted a longest common subsequence function and a bit of Googling failed to turn up a functional one, except for in a scary bit of darcs.
I saw a thread from back in the day about this when I was looking for a good implementation of sdiff (or diff) in Haskell (thank Googleness for Google): http://www.haskell.org/pipermail/haskell/2002-November/010731.html all message in thread: http://www.haskell.org/pipermail/haskell/2002-November/thread.html#10726 Jared.
participants (6)
-
Bayley, Alistair
-
Benedikt Schmidt
-
Bulat Ziganshin
-
Janis Voigtlaender
-
Jared Updike
-
Mark.Carroll@Aetion.com