major speed improvement: regex-tdfa reaches 1.0.0

Announcing the version 1.0.0 release of regex-tdfa. I am proud of this release. This is not just a bug fix release. It is a serious improvement in the asymptotic running time. The previous versions allowed bad combinations of pattern and searched text length to scale badly in the length of the text. Previously the worst case for text of length N was O(N^3). The new worst case asymptotic runtime scaled as O(N). There is never any backtracking. And the worst case storage space is independent of N. The package is on hackage at http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa The darcs repository is at http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/ All non-overlapping matches are found and returned, along with all captured (parenthesized) subexpressions. The result is precisely what Posix extended regular expressions are supposed to return. To be concrete, consider example text with length of N of (2^n)+2:
longInput = replicate (2^n) 'a' ++ "bc"
And define 4 worst-case-scenario patterns. I wrote the code and I know how to kill it:
wcs0 = "a*b" wcs1 = "a*c" wcs2 = "(a|aa)*b" wcs3 = "(a|aa)*c"
wcs0 is easy. wcs1 causes the old code to backtrack. wcs2 causes the old code's storage to scale as O(N). wcs3 causes both backtracking and O(N) storage with the old code. The old code's time scales as O(N) for wcs0, O(N^2) for wcs1 and wcs2, and O(N^3) for wcs3. The new code is always O(N). The actual timings for the old code on my G4 laptop for wcs on 2^8 and 2^9 and 2^10 are: Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 8 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA-np wcs3 8 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(257,1)),(1,(-1,0))]] 263,776,756 bytes allocated in the heap user 0m1.017s sys 0m0.058s Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 9 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA-np wcs3 9 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(513,1)),(1,(-1,0))]] 1,998,647,452 bytes allocated in the heap user 0m7.083s sys 0m0.289s Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 10 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA-np wcs3 10 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(1025,1)),(1,(-1,0))]] 15,653,277,200 bytes allocated in the heap user 0m53.350s sys 0m2.056s The doubling of length is causing a nearly eight-fold time increase. The heap allocation is also increasing nearly eight-fold. The new code with the same input pattern and texts gives: Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 8 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA2-0.99.19-np2 wcs3 8 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(257,1)),(1,(-1,0))]] 2,135,324 bytes allocated in the heap user 0m0.017s sys 0m0.017s Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 9 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA2-0.99.19-np2 wcs3 9 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(513,1)),(1,(-1,0))]] 3,588,656 bytes allocated in the heap user 0m0.024s sys 0m0.017s Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 10 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA2-0.99.19-np2 wcs3 10 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(1025,1)),(1,(-1,0))]] 6,345,436 bytes allocated in the heap user 0m0.038s sys 0m0.018s Note that the heap allocation for the 1026 character example above is 2466 times less than the old code. That was too fast to prove the scaling, so take more input: Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 20 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA2-0.99.19-np2 wcs3 20 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(1048577,1)),(1,(-1,0))]] 5,708,574,848 bytes allocated in the heap user 0m26.023s sys 0m0.985s Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 21 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA2-0.99.19-np2 wcs3 21 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(2097153,1)),(1,(-1,0))]] 11,416,354,056 bytes allocated in the heap user 0m52.656s sys 0m1.985s The length and time both doubled, as did the heap allocation. And the new code has searched two million characters in the time the old code searched one thousand. How about away from the worst case scenario? On the testing suite the new code is running slightly slower: Reason:compare-tdfa chrisk$ time ./Test-TDFA-np -r 1 100 > /dev/null user 0m4.841s sys 0m3.019s Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 -r 1 100 > /dev/null user 0m5.970s sys 0m3.012s So that is an increase of execution time of 14%. This small dip in performance might be reclaimable with more optimization. I think the gain in worst case performance already offsets the slight cost. The code for String is complete. The strict and lazy bytestrings and the (Seq Char) are currently using the String code for matching. This will be improved in a future release. Cheers, Chris The small print: regex-tdfa will still behave badly in space and time if given a pathological pattern, e.g. "(((a|aa){0,100}){0,100}){0,100}". But, I am hopeful than I can improve regex-tdfa to behave well with the patterns like this one. That is my vague goal for a future version 2.0.0 release. The very small print: I have been using ghc-6.10.1, and I think I have carried over cabal switches to make ghc-6.8.3 also work. The library probably can work with ghc-6.6, but the cabal file will need editing first as well as fixes to "Text.Regex.TDFA.NewDFA.copySTU".

ChrisK wrote:
The previous versions allowed bad combinations of pattern and searched text length to scale badly in the length of the text. Previously the worst case for text of length N was O(N^3).
The new worst case asymptotic runtime scaled as O(N). There is never any backtracking. And the worst case storage space is independent of N.
That's an awesome result. :-) Thanks! Martijn.

Congratulations! This is really cool.
And this also made me find a bug in GHC :)
http://hackage.haskell.org/trac/ghc/ticket/3059
2009/3/2 ChrisK
Announcing the version 1.0.0 release of regex-tdfa.
I am proud of this release. This is not just a bug fix release. It is a serious improvement in the asymptotic running time.
The previous versions allowed bad combinations of pattern and searched text length to scale badly in the length of the text. Previously the worst case for text of length N was O(N^3).
The new worst case asymptotic runtime scaled as O(N). There is never any backtracking. And the worst case storage space is independent of N.
The package is on hackage at http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa The darcs repository is at http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/
All non-overlapping matches are found and returned, along with all captured (parenthesized) subexpressions. The result is precisely what Posix extended regular expressions are supposed to return.
To be concrete, consider example text with length of N of (2^n)+2:
longInput = replicate (2^n) 'a' ++ "bc"
And define 4 worst-case-scenario patterns. I wrote the code and I know how to kill it:
wcs0 = "a*b" wcs1 = "a*c" wcs2 = "(a|aa)*b" wcs3 = "(a|aa)*c"
wcs0 is easy. wcs1 causes the old code to backtrack. wcs2 causes the old code's storage to scale as O(N). wcs3 causes both backtracking and O(N) storage with the old code.
The old code's time scales as O(N) for wcs0, O(N^2) for wcs1 and wcs2, and O(N^3) for wcs3. The new code is always O(N). The actual timings for the old code on my G4 laptop for wcs on 2^8 and 2^9 and 2^10 are:
Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 8 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA-np wcs3 8 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(257,1)),(1,(-1,0))]] 263,776,756 bytes allocated in the heap user 0m1.017s sys 0m0.058s
Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 9 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA-np wcs3 9 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(513,1)),(1,(-1,0))]] 1,998,647,452 bytes allocated in the heap user 0m7.083s sys 0m0.289s
Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 10 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA-np wcs3 10 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(1025,1)),(1,(-1,0))]] 15,653,277,200 bytes allocated in the heap user 0m53.350s sys 0m2.056s
The doubling of length is causing a nearly eight-fold time increase. The heap allocation is also increasing nearly eight-fold.
The new code with the same input pattern and texts gives:
Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 8 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA2-0.99.19-np2 wcs3 8 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(257,1)),(1,(-1,0))]] 2,135,324 bytes allocated in the heap user 0m0.017s sys 0m0.017s
Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 9 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA2-0.99.19-np2 wcs3 9 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(513,1)),(1,(-1,0))]] 3,588,656 bytes allocated in the heap user 0m0.024s sys 0m0.017s
Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 10 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA2-0.99.19-np2 wcs3 10 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(1025,1)),(1,(-1,0))]] 6,345,436 bytes allocated in the heap user 0m0.038s sys 0m0.018s
Note that the heap allocation for the 1026 character example above is 2466 times less than the old code.
That was too fast to prove the scaling, so take more input:
Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 20 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA2-0.99.19-np2 wcs3 20 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(1048577,1)),(1,(-1,0))]] 5,708,574,848 bytes allocated in the heap user 0m26.023s sys 0m0.985s
Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 21 +RTS -sstderr 2>&1 | head -4 ./Test-TDFA2-0.99.19-np2 wcs3 21 +RTS -sstderr Test for [Char] Right [array (0,1) [(0,(2097153,1)),(1,(-1,0))]] 11,416,354,056 bytes allocated in the heap user 0m52.656s sys 0m1.985s
The length and time both doubled, as did the heap allocation. And the new code has searched two million characters in the time the old code searched one thousand.
How about away from the worst case scenario? On the testing suite the new code is running slightly slower:
Reason:compare-tdfa chrisk$ time ./Test-TDFA-np -r 1 100 > /dev/null user 0m4.841s sys 0m3.019s
Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 -r 1 100 > /dev/null user 0m5.970s sys 0m3.012s
So that is an increase of execution time of 14%. This small dip in performance might be reclaimable with more optimization. I think the gain in worst case performance already offsets the slight cost.
The code for String is complete. The strict and lazy bytestrings and the (Seq Char) are currently using the String code for matching. This will be improved in a future release.
Cheers, Chris
The small print: regex-tdfa will still behave badly in space and time if given a pathological pattern, e.g. "(((a|aa){0,100}){0,100}){0,100}". But, I am hopeful than I can improve regex-tdfa to behave well with the patterns like this one. That is my vague goal for a future version 2.0.0 release.
The very small print: I have been using ghc-6.10.1, and I think I have carried over cabal switches to make ghc-6.8.3 also work. The library probably can work with ghc-6.6, but the cabal file will need editing first as well as fixes to "Text.Regex.TDFA.NewDFA.copySTU".
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Web IR developer, market.yandex.ru
participants (3)
-
ChrisK
-
Eugene Kirpichov
-
Martijn van Steenbergen