ANN: regex-* updates and regex-tdfa 0.20 first release

Greetings. I have several bits of news about regex-* packages. First: I have fixed a small bug in one instance of RegexContext, defined in Text.Regex.Contex in the regex-base class (included in GHC). The new version of regex-base is 0.80 and is available at http://darcs.haskell.org/packages/regex-unstable/regex-base/. The instance fixed was
instance (RegexLike a b) => RegexContext a b (MatchResult b) where and I doubt this affects anyone at the moment.
Second: There is also a new version of regex-tre at http://darcs.haskell.org/packages/regex-unstable/regex-tre/ (version 0.81) that exposes the 'compRightAssoc' option when making a Regex (see Text.Regex.TRE.Wrap for all the options). Third: I wish to note that that the libtre backend (up to the current version 0.7.5) that regex-tre uses is buggy and that regex-tre / libtre should only be used where testing shows that no bugs are being trigged. (Why go to that much trouble? Because the libtre backend is very very fast). Fourth: The fabulous new all Haskell POSIX compatible backend! This is a first announcement of regex-tdfa, my "tagged" DFA regular expression backend. It is an implementation of regular expressions in Haskell; it does not hand off the data to a library written in c. The ultimate goal of regex-tdfa is to replace regex-posix as the default engine that come with GHC. Why go to the trouble of replacing regex-posix? Because regex-posix is very very slow (and not quite POSIX, as seen at the end of this message). The new regex-tdfa does not use the slow backtracking method like my pure Haskell backend regex-posix. It uses a deterministic finite automata, similar in spirit to one I derived from CTKlight for use in the pure Haskell regex-dfa backend (which is LGPL -- the rest are BSD-3 license). But regex-dfa is fast at a cost of not returning any parenthesized subexpression captures (i.e only \0 and not \1 \2 \3 ...). The regex-tre backend (BSD3) that uses the libtre c library (LGPL) employs a special "tagged" dfa technique to get subexpression captures while still using an efficient DFA. The regex-posix engine that comes with GHC is impossibly slow but regex-tre is very very fast and (claims to also be) a POSIX engine. I was thus inspired to create regex-tdfa which aims to create a pure Haskell variant of this tagged-DFA algorithm under a simple BSD3 license. The 0.20 version of regex-tdfa that I am announcing is available at http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/ and requires the updated regex-base (version 0.80 or greater) and is setup to compile for GHC 6.6. These could be back-ported to GHC 6.4 by adding the fps package to GHC 6.4 to get Data.ByteString and removing references to Data.Sequence in the new regex-base and regex-tdfa packages. As of now, my first untuned version of the regex-tdfa backend performs about 2-3 times slower than regex-tre and the libtre c library. This also means regex-tdfa performs 3+ times faster than the regex-parsec engine. So the goal of a fast replacement for regex-posix is close to hand. The bad news: libtre has two problems, the first is that it is buggy. I have found patterns that the libtre engine does not handle correctly: the wrong match for the whole expression and/or subexpressions is returned (or it will fail to match when it should have succeeded). The second problem is that it does not always follow the POSIX standard policy in determining which possible parenthesized subexpession captures to return when there are several possibilities. My regex-tdfa engine does not have the bugs that I have found in libtre. But since it uses the same design concepts as libtre it had the same problem with not following the POSIX standard policy for capturing subexpressions. To fix this I have exteneded the tagging technique to keep enough information to follow the POSIX spec. So of all the regex-* backends, the regex-tdfa is the only one that passes all the tests that I have for POSIX compatibility. End note: The regex-posix engine, despite its name, does not follow the POSIX standard policy for capturing subexpressions. Example 1 of 2: "ab" =~ "((a?)((ab)?))(b?)" This should capture the sub ranges (0,2)(0,2)(0,0)(0,2)(0,2)(2,2) But it does capture the sub ranges (0,2)(0,1)(0,1)(1,1)(-1,-1)(1,2) The first group is (0,2) means the whole match was \0=="ab" The second group (0,2) or (0,1) means the first group "((a?)((ab)?))" should match \1=="ab" but it returns \1=="a". This first group should have been chosen for maximum length and the failure to do so is a policy violation. Example 2 of 2: "x" =~ "(.?)*" This should capture: (0,1)(0,1) But it does capture: (0,1)(1,1) The first group of (0,1) means the regex did match the whole \0=="x". But the regex-posix engine matched an extra iteration of the "(.?)" under the "*" operator against the empty string after the "x" and returned the empty capture of (1,1). This highly non-useful behavior violates POSIX policy. My regex-parsec engine follows a non-POSIX policy, in which it maximizes the length of just the captured subexpressions instead of all the subpatterns of the regular expression. This is useful and was easy to write and is easy to explain and agrees with the policy of other libraries such as Boost.Regex for C++ ( as in the third FAQ at http://www.boost.org/libs/regex/doc/faq.html ) But to replace regex-posix I need a fast POSIX policy engine, which is what regex-tnfa has now achieved. I will submit regex-tdfa to replace regex-posix once I have a stable version that has been further cleaned up and tuned. Anyone who wishes a useful performance time/spacce tuning challenge is free to help. If you have any evil or stressful examples of regular expressions, then I welcome adding them to my tests. -- Chris Kuklewicz
participants (1)
-
Chris Kuklewicz