
G'day all. On Wed, Oct 09, 2002 at 02:29:26PM -0400, David Roundy wrote:
I get a speedup of about a factor of 6 for the test files I was using (of course, this would depend on file size), and find that now only 2% of my time is spent in that function. I'm still something like 100 times slower than GNU diff, so I think (hope, certainly!) there's still room for more improvement (another day). I'm sure I'll have more questions in a few days, once I've tracked down what the new bottlenecks are.
If you understand logic languages, you might want to look at the diff implementation which I wrote for the Mercury distribution: http://www.cs.mu.oz.au/research/mercury/ It's pretty close to GNU diff in speed. In fact, it was indistinguishable on every test case I could think of. There are, two main differences to what I could gather from your implementation. First thing was that I noticed that a lot of time was being spent doing string comparisons. I inserted a pre-pass which mapped strings to (unboxed) integers with the constraint that the integers are equal if and only if the strings are equal. This also turned out to be critical for implementing flags such as --ignore-space-change. The other was I used a different algorithm than you did: Eugene W. Myers. "An O(ND) difference algorithm and its variations." Algorithmica, 1:251-266, 1986. I found it to be much faster than the O(n log n) algorithm, even on cases where you would expect it to perform poorly (i.e. where D is large), partly because the constant factors are really, really low and because in the pre-pass you can optimise the case where you have a number of consecutive lines none of which appear anywhere else, which is typical for most uses of diff. Cheers, Andrew Bromage