[Haskell-cafe] Re: ANN: weighted-regexp-0.1.0.0

28 Jul 2010

      ...
Maybe I underestimated the utility of ^ and $. The definition seems
intricate. I thought about adding a combinator for matching newline but
now think that would lead to wrong start and end positions. For example
the start position of the matching substring for  ^a  in  "a\na"  should
be 2 not 1, right? Or is it 0 although there is no newline at the
beginning?
The first "a" would match with indexes (0,1) and the second "a" would
match with indexes (1,2).
...
Is there a page with examples that show how ^ and $ should behave exactly?
Without REG_NEWLINE the meanings are:

. matches any single character (though note that handling of a zero byte
is impossible for C style strings for a different reason).

^ is an assertion that instead of being AlwaysTrue (eps)or AlwaysFalse
(noMatch) is true before any characters have been accepted and false
afterward.

$ is an assertion that is true only when there are no more characters to
match and false before this.

With REG_NEWLINE the meanings are:

. matches any single character EXCEPT '\n' newline (ASCII 10, I think).

^ is true before any characters have been matched and true right after a
newline has been matched, else false.

^ is true when there are no more characters to match and true if the
next character to match is a newline, else false.

Let 'a' and 'b' and 'c' be some complicated regular expressions that
cannot accept a newline with REG_NEWLINE enabled:

^$ finds blank lines, the indexes between newlines or between a newline
and the start or end of the text.

^a$ requires 'a' to exactly fill a line and the captured string has no
newlines.

A more complicated use, perhaps as part of a crazy parser:
"(a(\n)?)(^|b)(c|$)" has 'a' much some text and perhaps the newline.  If
the newline was there then the ^ matches and b might be skipped,
otherwise b must be used.  The match ends with '(c|$)' is thus either
starting the new line or trailing b.  And (c|$) can avoid matching 'c'
if the next character is a newline.

Note that the regular expression "(^|[aA])" has a non-trivial
"can_accept_empty" property: it can sometimes accept empty.

And if you are recording parenthetical captures then "(^)?" is subtle.
When ^ is true the (^) succeeds like () and when it is false it does
not.  This inserts a test into the pattern that can be checked later.

And "((^$)|(^)|($))" is worse: it does not always succeed and which
sub-pattern gets captured depends on the presence of one or two newlines.

In "((^)|(^$))" it is impossible for (^$) to be used since the first (^)
will always be favored by the POSIX rules.

Similarly "(()|(^))" will never use (^).

A small chunk of regex-tdfa sifts through the possible ways to accept 0
characters for each node in the parse-tree and keeps an ordered list of
sets of assertions to check, and cleans outs those that are logically
excluded.

Slightly more useful anchors are added in Perl/PCRE:
...
ANCHORS AND SIMPLE ASSERTIONS
         \b          word boundary
         \B          not a word boundary
         ^           start of subject
                      also after internal newline in multiline mode
         \A          start of subject
         $           end of subject
                      also before newline at end of subject
                      also before internal newline in multiline mode
         \Z          end of subject
                      also before newline at end of subject
         \z          end of subject
         \G          first matching position in subject
I added \b \B as above, and added \` \' to be like \A and \Z above, and
added  \< and \> to be beginning and end of word assertions.

With enough assertions and negated assertions one could level up to
using a binary decision diagram to express when a sub-pattern can accept
0 characters.

Ville's libtre gets this wrong:
...
Searched text: "searchme"
Regex pattern: "((s^)|(s)|(^)|($)|(^.))*"
Expected output: "(0,1)(0,1)(-1,-1)(0,1)(-1,-1)(-1,-1)(-1,-1)"
Actual result  : "(0,1)(0,1)(-1,-1)(0,1)(1,1)(1,1)(-1,-1)"
And sometimes very wrong:
Searched text: "searchme"
Regex pattern: "s(^|())e"
Expected output: "(0,2)(1,1)(1,1)"
Actual result  : "NOMATCH"
Cheers,
  Dr. Chris Kuklewicz