[Haskell-cafe] Re: ANN: weighted-regexp-0.1.0.0

28 Jul 2010

      ...
As there are both a fullMatch and a partialMatch function, I don't
see an immediate need for anchors, although I admit that they have
the advantage that you can specify *in the regexp* whether you want
full or partial matching.
The REG_NEWLINE flag for compiling POSIX regular expressions is defined as:
...
REG_NEWLINE   Compile for newline-sensitive matching.  By default,
newline is a completely ordinary character with no special meaning in
either REs or strings.  With this flag, `[^' bracket expressions and
`.' never match newline, a `^' anchor matches the null string after
any newline in the string in addition to its normal function, and the
`$' anchor matches the null string before any newline in the string
in addition to its normal function.
Using ^ and $ to match near newlines makes matching patterns that cross
lines much easier to control.  But anchors are also a source of bugs for
some implementations, e.g. libtre.
...
I am not sure whether it is a good idea to implement submatch capture  
only by means of a different semiring. I would not hesitate to extend  
the matching algorithm itself, if that leads to a clean implementation.
The "tag" idea that I copy from Ville's thesis is that each NFA state
holds a history.  This history is your expression tree with each node
(i.e. sub-pattern) recording the most recent input index where the
sub-pattern was entered and when it was exited.  During matching two
such histories will be "added" if they arrive at the same NFA state,
where "adding" compares the sub-patterns in priority order to choose
which of the two histories is kept and which is discarded.

The problem is keeping the histories bounded and the answer correct.
Making this work efficiently (linear in time, constant in space) and
match POSIX substrings has been hard.

Doing things like expanding /a{2,5}/ helps.  The thing that makes
constant space hard is that /a*/ cannot be expanded as the blow-up is
unbounded.  Expanding during execution is still linear in space: one
needs to record the length of each iteration of the sub-pattern /a/ in
an unbounded list.  Ville's libtre cheats and gets the wrong answer by
using only the length of the final iteration.  With only the last length
one can only maximize (or minimize) the last iteration of /a*/.  The
re-interpretation specification makes it clear that the length of each
iteration from left to right ought to be greedy, constrained by the
overall left-long match.  Thus one _needs_ to use the length of _all_
iterations to be sure of getting the correct POSIX substring match for
the last iteration.

The original regex-tdfa took linear space and got the right answer.  The
current regex-tdfa does a clever thing to compress the list of lengths
and thus runs constant space.  The clever trick while matching is,
schematically:
1) Ever 100 characters pause and perform a "compression", which is
2) Go though the NFA states and sort the lists of repetition lengths
3) Replace the lists of repetition lengths with a single integer which
is the ordinal position found by the sorting
4) Resume matching, building up the lists again for the next 100
characters...goto step 1

Thus the lists are bounded at 100 entries at the cost of breaking the
abstraction that the NFA "marks" only interact with each other when they
collide.  The compression step pre-sorts the "mark" histories to decide
the result of a possible future collision.  It is implemented as a
time/space trade-off.  A more elegant re-implementation might do away
with the separate NFA "mark" abstraction and keep them always sorted...
...
I am currently rewriting the
...
algorithm in OCaml (to learn OCaml) and I am trying to avoid the
/a{2,5}/ blowup that regex-tdfa has.
Did you experience problems with such blowup? The re2 library also  
implements bounded repetitions by blowing them up, so, at least, it  
doesn't seem to be a big problem for Google.
Blowup in regex-tdfa is a problem for large blowups.  The constant
bounded history space required for a length 'm' expression is O(m^2)
because of the substring capturing information.  And I use a lazy DFA so
I am vulnerable to making a graph of size 2^m until the whole expression
gets garbage collected.

I am impressed by how your lazy Haskell code handles infinite 'm'!

The OCaml re-implementation of regex-tdfa should fight this space
blow-up by using an one-the-fly DFA like yours and by not expanding /a+/
/a{i,}/ and /a{i,k}/ expressions.  But this is unfinished.

Cheers,
  Dr. Chris Kuklewicz