
As there are both a fullMatch and a partialMatch function, I don't see an immediate need for anchors, although I admit that they have the advantage that you can specify *in the regexp* whether you want full or partial matching.
The REG_NEWLINE flag for compiling POSIX regular expressions is defined as:
REG_NEWLINE Compile for newline-sensitive matching. By default, newline is a completely ordinary character with no special meaning in either REs or strings. With this flag, `[^' bracket expressions and `.' never match newline, a `^' anchor matches the null string after any newline in the string in addition to its normal function, and the `$' anchor matches the null string before any newline in the string in addition to its normal function.
Using ^ and $ to match near newlines makes matching patterns that cross lines much easier to control. But anchors are also a source of bugs for some implementations, e.g. libtre.
I am not sure whether it is a good idea to implement submatch capture only by means of a different semiring. I would not hesitate to extend the matching algorithm itself, if that leads to a clean implementation.
The "tag" idea that I copy from Ville's thesis is that each NFA state holds a history. This history is your expression tree with each node (i.e. sub-pattern) recording the most recent input index where the sub-pattern was entered and when it was exited. During matching two such histories will be "added" if they arrive at the same NFA state, where "adding" compares the sub-patterns in priority order to choose which of the two histories is kept and which is discarded. The problem is keeping the histories bounded and the answer correct. Making this work efficiently (linear in time, constant in space) and match POSIX substrings has been hard. Doing things like expanding /a{2,5}/ helps. The thing that makes constant space hard is that /a*/ cannot be expanded as the blow-up is unbounded. Expanding during execution is still linear in space: one needs to record the length of each iteration of the sub-pattern /a/ in an unbounded list. Ville's libtre cheats and gets the wrong answer by using only the length of the final iteration. With only the last length one can only maximize (or minimize) the last iteration of /a*/. The re-interpretation specification makes it clear that the length of each iteration from left to right ought to be greedy, constrained by the overall left-long match. Thus one _needs_ to use the length of _all_ iterations to be sure of getting the correct POSIX substring match for the last iteration. The original regex-tdfa took linear space and got the right answer. The current regex-tdfa does a clever thing to compress the list of lengths and thus runs constant space. The clever trick while matching is, schematically: 1) Ever 100 characters pause and perform a "compression", which is 2) Go though the NFA states and sort the lists of repetition lengths 3) Replace the lists of repetition lengths with a single integer which is the ordinal position found by the sorting 4) Resume matching, building up the lists again for the next 100 characters...goto step 1 Thus the lists are bounded at 100 entries at the cost of breaking the abstraction that the NFA "marks" only interact with each other when they collide. The compression step pre-sorts the "mark" histories to decide the result of a possible future collision. It is implemented as a time/space trade-off. A more elegant re-implementation might do away with the separate NFA "mark" abstraction and keep them always sorted...
I am currently rewriting the
algorithm in OCaml (to learn OCaml) and I am trying to avoid the /a{2,5}/ blowup that regex-tdfa has.
Did you experience problems with such blowup? The re2 library also implements bounded repetitions by blowing them up, so, at least, it doesn't seem to be a big problem for Google.
Blowup in regex-tdfa is a problem for large blowups. The constant bounded history space required for a length 'm' expression is O(m^2) because of the substring capturing information. And I use a lazy DFA so I am vulnerable to making a graph of size 2^m until the whole expression gets garbage collected. I am impressed by how your lazy Haskell code handles infinite 'm'! The OCaml re-implementation of regex-tdfa should fight this space blow-up by using an one-the-fly DFA like yours and by not expanding /a+/ /a{i,}/ and /a{i,k}/ expressions. But this is unfinished. Cheers, Dr. Chris Kuklewicz