[darcs #222] performance problem with massive changes in a single file

I'm cc'ing haskell-cafe on this darcs bug, since I imagine there might be a student (or professor) hanging around there who might be interested in the implementation of an efficient LCS algorithm in haskell. [markjugg - Sun Feb 20 09:15:57 2005]:
Typically in software, 1000's of line aren't changed at the time, which is why this hasn't been an big issue before.
Indeed. The issue here is that darcs uses the exact Hunt-Szymanski LCS algorithm when it computes the diff, while GNU diff uses an approximate --but faster--algorithm, which works "almost" as well. The approximate algorithm is reasonably well-documented, but more complex to implement than the exact one, just because it's harder to understand. This would be a reasonable project for an interested haskell CS student. The LCS is implemented in pure haskell 98, and is of signature lcs :: Ord a => [a] -> [a] -> [a] It's an interesting problem that's easily testable, and has different scaling behavior in different limits (finite alphabet, infinite alphabet, permutations of unique objects). David (ever optimistic for new contributors to darcs!) http://darcs.net

G'day all.
Quoting David Roundy via RT
Indeed. The issue here is that darcs uses the exact Hunt-Szymanski LCS algorithm when it computes the diff, while GNU diff uses an approximate --but faster--algorithm, which works "almost" as well. The approximate algorithm is reasonably well-documented, but more complex to implement than the exact one, just because it's harder to understand.
This isn't entirely true. The Myers algorithm used in GNU diff is exact. However, it's also designed to adapt (if allowed) in the case where it seems to be taking too long. GNU diff also goes to some trouble to help the Myers algorithm along. A common case in real-life diffing is that you have a "block" of adjacent lines in one file, none of which appear in the other file. In that case, the block can be combined into one "virtual line", which reduces the effective length of the diff, making most LCS algorithms go faster. A problem with the Myers algorithm is that the diff that it produces is less "pretty" than that produced by most other means.
This would be a reasonable project for an interested haskell CS student. The LCS is implemented in pure haskell 98, and is of signature
lcs :: Ord a => [a] -> [a] -> [a]
The returned list is proportional to the length of the smaller of the input lists, which is, in general, quite large compared with the size of the patch. It's probably more efficient to produce a list of edits: data Source = LeftFile | RightFile diff :: (Ord a) => [a] -> [a] -> [(SourceFile, Int)] Where each tuple represents a line which has no correspondence in the other file. If anyone is curious, I did a pretty complete implementation of diff for Mercury: http://www.mercury.cs.mu.oz.au/ It should be fairly directly translatable into Haskell if someone doesn't feel like thinking. Cheers, Andrew Bromage
participants (2)
-
ajb@spamcop.net
-
David Roundy via RT