
Greetings, My new regular expression backend "regex-tdfa" is passing all the tests I have been throwing at it, so it is time to share the good news. First up: Where is it? Version 0.56 is in my "stable" location at
darcs get --partial darcs.haskell.org:/home/darcs/packages/regex-tdfa The version that will get updated and broken more often is "unstable" at darcs get --partial darcs.haskell.org:/home/darcs/packages/regex-unstable/regex-tdfa
It needs a slightly updated regex-base package (version >= 0.80) at
darcs get --partial darcs.haskell.org:/home/darcs/packages/regex-unstable/regex-base
It has only been tested (on Max OS X) with GHC 6.6 and depends on base >= 2.0 for Data.ByteString, Data.ByteString.Lazy, and Data.Sequence. ** What does regex-tdfa do? It runs POSIX regular expression searches correctly. No other regular expression backend for Haskell does this correctly. The regular expression can be chosen to be from a String, ByteString, ByteString.Lazy, or (Data.Seq Char). Independent of that choice the searched text can also be any of those types. ** Properties and benefits of regex-tdfa: It follows a known standard: POSIX. (If you find that it disagrees with the standard then I will fix it.) It is written in Haskell, so there is no C library to compile and install. Though it is currently only written for GHC 6.6, it could be ported to any other compiler with only minor changes. (I expect that if you drop the polymorphic MPTC interface of regex-base it could be made Haskell98). It works with Unicode since it can be used on [Char] (and it works if there are '\NUL' characters around). It is lazy in the regular expression: it only builds the DFA if you use it, and only builds the state it needs, which are memoized in a lazy Trie. It is lazy in the searched text, and the pure function that searches the text only evaluates the text until the next match is done and does not hold onto the initial text. Thus you could efficiently search a [Char] up to the size of the Position type, which is an Int for now. It is free, the regex-tdfa library is under a BSD-3 license. ** Does it work? Yes. The regex-posix backend that come with GHC does not run all the searches correctly: it is buggy (example at end of announcement). And the regex-posix backend is also impossibly slow -- the slowest backend by far -- orders of magnitude slower. The regex-tre backend that wraps the TRE library has exposed this very fast library to the harsh gaze of QuickCheck. The lesson from QuickCheck is to never use regex-tre because the TRE library (version 0.7.4) is buggy. For regex-tdfa backend I took/stole ideas from Ville Laurikari's Master thesis which describes the algorithm that TRE uses. And I made a pure Haskell implementation that now passes all the tests at http://www.research.att.com/~gsf/testregex/testregex.html and follows the notes at http://www.research.att.com/~gsf/testregex/re-interpretation.html and passes those tests as well. And it "passes" QuickCheck tests which actually just compare the answers that regex-posix, regex-tre, and regex-tdfa since I do not have a bug free reference implementation to compare against. ** Why the name TDFA? Tagged Deterministic Finite Automata. The Tagged part is from Ville Laurikari's thesis work. ** Speed: If you do not care only about the match as a whole and not about subexpression matching then you can set the captureGroups ExecOption to False: makeMyRegex = makeRegex defaultCompOpt myExecOpt where myExecOpt = defaultExecOpt { captureGroups = False } And redefine (=~) and (=~~) to use makeMyRegex instead of makeRegex: (=~) :: (RegexMaker Regex CompOption ExecOption source,RegexContext Regex source1 target) => source1 -> source -> target (=~) s r = let re = makeMyRegex r :: Regex in match re s (=~~) :: (RegexMaker Regex CompOption ExecOption source,RegexContext Regex source1 target,Monad m)=> source1 -> source -> m target (=~~) s r = let re = makeMyRegex r :: Regex in match re s The benefit to this is speed, as it then proceeds to run all matches of "b(..?c)?d" against a large test file *faster* than the the TRE C library backend. My compliments to those who worked on ghc and -O2. It even runs nearly as fast as regex-dfa in this mode. Thus regex-dfa is (mostly) obsolete. If you use the default and capture the subgroups then the speed is reduced by a factor of about 4. This is because of the very simple data structures that keep track of the run time state. I intend to improve this aspect of the design now that I have a working algorithm. I have not benchmarked the speed of compiling the regular expression from the String. This compiling is done with very lazy data structures, so the DFA size is really only as big as you actually explore while running the match. ** Examples of bugs in regex-posix and regex-tre: An example for regex-posix:
*Main> "ab" Posix.=~ "((a)|(b)){2,}" :: MatchArray array (0,3) [(0,(0,2)),(1,(1,1)),(2,(0,1)),(3,(1,1))
The array holds (group #, (match offset, match length)) The 0th group is the whole match "ab". The 2nd group is (a) and it matched the first "a" and the third group is (b) and matched the "b". But the second group is a child of the first group ((a)|(b)) which matched "b". The standard requires that captured groups are nested inside their parents, and here "a" is not not nested inside "b". So the regex-posix C backend that comes with GHC is buggy. The correct answer is, where the second group matched nothing:
*Main> "ab" TDFA.=~ "((a)|(b)){2,}" :: MatchArray array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(1,1))]
An example for regex-tre: "yyyyyy" TRE.=~ "(yyy|(x?)){2,4}" array (0,2) [(0,("yyyyyy",(0,6))),(1,("yyy",(3,3))),(2,("yyy",(3,3)))],"") Here TRE reports that the second group (x?) which should only match "" or "x" somehow matched "yyy". This is even more wrong than the regex-posix bug above. The correct answer, where the second group does not match: array (0,2) [(0,("yyyyyy",(0,6))),(1,("yyy",(3,3))),(2,("",(-1,0)))],"") -- Chris Kuklewicz
participants (1)
-
Chris Kuklewicz