Policy change for regex libraries

12 Jan 2009

      Before I go and add or change anything in the Haskell regex-posix library, I 
wanted to get some feedback.  regex-posix provides Text.Regex.Posix, and is 
built on the regex-base package.

regex-posix is also used as the backend for Text.Regex (the regex-compat package 
does this).  I do not intend to change the behavior of the old Text.Regex API.

The main issue is the behavior when returning a list of all matches of a Regex 
against a target text.  I no longer think the current behavior is the right 
choice when it comes to zero-length matches.

The current behavior is to return non-overlapping matches with the caveat that 
after the first zero-length match the search is ended.  Note that the 
zero-length match may be occur at the end position of a previous non-zero-length 
match.

Notably, no one has complained about this policy.  But I no longer like it.  So 
here are a few of my ideas of what to change it to:

0) No change, not worth the effort.

1) return the zero-length match, skip forward 1 character, and continue 
searching.  If the consumer wishes the old policy they can truncate the list. 
This could also be filtered to resemble option 2 below.

2) Mimic "sed".  It seems "sed" has a policy where a zero-length match is 
forbidden to occur at the end position of a non-zero-length match.  "sed" does 
not stop with the first zero-length match.

3) implement additional execution options, so the user can choose a policy.  The 
default policy choice left with the current behavior.

4) implement additional execution options, so the user can choose a policy.  The 
default policy choice set to he behavior in (1).

5) Return valid matches starting from all positions, including overlapping 
matches.  This I really do not like and one can run the search starting one 
character after the start of the last match to get this information.

Matching "0123" and replacing all matches with themselves wrapped in angle 
brackets.  The policies of 0, 1, and 2 above lead to (computed partly by hand):

regex of "[0123]?"
0): "<0><1><2><3><>"
1): "<0><1><2><3><>"
2): "<0><1><2><3>"

regex of "[012]?"
0): "<0><1><2>3<>"
1): "<0><1><2>3<>"
2): "<0><1><2>3<>"

regex of "[013]?"
0): "<0><1><>23"
1): "<0><1><>2<3><>"
2): "<0><1>2<3>"

regex of "[023]?"
0): "<0><>123"
1): "<0><>1<2><3><>"
2): "<0>1<2><3>"

regex of "[123]?"
0): "<>0123"
1): "<>0<1><2><3><>"
2): "<>0<1><2><3>"

regex of "[03]?"
0): "<0><>123"
1): "<0><>1<>2<3><>"
2): "<0>1<>2<3>"

regex of "[03]?"
0): "<0><>123"
1): "<0><>1<>2<3><>"
2): "<0>1<>2<3>"

regex of "[12]?"
0): "<>0123"
1): "<>0<1><2><>3<>"
2): "<>0<1><2>3<>"

I am leaning to simply changing it from policy 0 to policy 1.

Are there any objections?

Perhaps I should set a deadline? Now where is that library policy...

-- 
Chris

Chris Kuklewicz

Don Stewart

mail＠justinbogner.com

Chris Kuklewicz

mail＠justinbogner.com

Chris Kuklewicz

mail＠justinbogner.com

tags

participants (3)