
Maybe I underestimated the utility of ^ and $. The definition seems intricate. I thought about adding a combinator for matching newline but now think that would lead to wrong start and end positions. For example the start position of the matching substring for ^a in "a\na" should be 2 not 1, right? Or is it 0 although there is no newline at the beginning?
The first "a" would match with indexes (0,1) and the second "a" would match with indexes (1,2).
Is there a page with examples that show how ^ and $ should behave exactly?
Without REG_NEWLINE the meanings are: . matches any single character (though note that handling of a zero byte is impossible for C style strings for a different reason). ^ is an assertion that instead of being AlwaysTrue (eps)or AlwaysFalse (noMatch) is true before any characters have been accepted and false afterward. $ is an assertion that is true only when there are no more characters to match and false before this. With REG_NEWLINE the meanings are: . matches any single character EXCEPT '\n' newline (ASCII 10, I think). ^ is true before any characters have been matched and true right after a newline has been matched, else false. ^ is true when there are no more characters to match and true if the next character to match is a newline, else false. Let 'a' and 'b' and 'c' be some complicated regular expressions that cannot accept a newline with REG_NEWLINE enabled: ^$ finds blank lines, the indexes between newlines or between a newline and the start or end of the text. ^a$ requires 'a' to exactly fill a line and the captured string has no newlines. A more complicated use, perhaps as part of a crazy parser: "(a(\n)?)(^|b)(c|$)" has 'a' much some text and perhaps the newline. If the newline was there then the ^ matches and b might be skipped, otherwise b must be used. The match ends with '(c|$)' is thus either starting the new line or trailing b. And (c|$) can avoid matching 'c' if the next character is a newline. Note that the regular expression "(^|[aA])" has a non-trivial "can_accept_empty" property: it can sometimes accept empty. And if you are recording parenthetical captures then "(^)?" is subtle. When ^ is true the (^) succeeds like () and when it is false it does not. This inserts a test into the pattern that can be checked later. And "((^$)|(^)|($))" is worse: it does not always succeed and which sub-pattern gets captured depends on the presence of one or two newlines. In "((^)|(^$))" it is impossible for (^$) to be used since the first (^) will always be favored by the POSIX rules. Similarly "(()|(^))" will never use (^). A small chunk of regex-tdfa sifts through the possible ways to accept 0 characters for each node in the parse-tree and keeps an ordered list of sets of assertions to check, and cleans outs those that are logically excluded. Slightly more useful anchors are added in Perl/PCRE:
ANCHORS AND SIMPLE ASSERTIONS \b word boundary \B not a word boundary ^ start of subject also after internal newline in multiline mode \A start of subject $ end of subject also before newline at end of subject also before internal newline in multiline mode \Z end of subject also before newline at end of subject \z end of subject \G first matching position in subject
I added \b \B as above, and added \` \' to be like \A and \Z above, and added \< and \> to be beginning and end of word assertions. With enough assertions and negated assertions one could level up to using a binary decision diagram to express when a sub-pattern can accept 0 characters. Ville's libtre gets this wrong:
Searched text: "searchme" Regex pattern: "((s^)|(s)|(^)|($)|(^.))*" Expected output: "(0,1)(0,1)(-1,-1)(0,1)(-1,-1)(-1,-1)(-1,-1)" Actual result : "(0,1)(0,1)(-1,-1)(0,1)(1,1)(1,1)(-1,-1)" And sometimes very wrong: Searched text: "searchme" Regex pattern: "s(^|())e" Expected output: "(0,2)(1,1)(1,1)" Actual result : "NOMATCH"
Cheers, Dr. Chris Kuklewicz