Policy change for regex libraries

Before I go and add or change anything in the Haskell regex-posix library, I wanted to get some feedback. regex-posix provides Text.Regex.Posix, and is built on the regex-base package. regex-posix is also used as the backend for Text.Regex (the regex-compat package does this). I do not intend to change the behavior of the old Text.Regex API. The main issue is the behavior when returning a list of all matches of a Regex against a target text. I no longer think the current behavior is the right choice when it comes to zero-length matches. The current behavior is to return non-overlapping matches with the caveat that after the first zero-length match the search is ended. Note that the zero-length match may be occur at the end position of a previous non-zero-length match. Notably, no one has complained about this policy. But I no longer like it. So here are a few of my ideas of what to change it to: 0) No change, not worth the effort. 1) return the zero-length match, skip forward 1 character, and continue searching. If the consumer wishes the old policy they can truncate the list. This could also be filtered to resemble option 2 below. 2) Mimic "sed". It seems "sed" has a policy where a zero-length match is forbidden to occur at the end position of a non-zero-length match. "sed" does not stop with the first zero-length match. 3) implement additional execution options, so the user can choose a policy. The default policy choice left with the current behavior. 4) implement additional execution options, so the user can choose a policy. The default policy choice set to he behavior in (1). 5) Return valid matches starting from all positions, including overlapping matches. This I really do not like and one can run the search starting one character after the start of the last match to get this information. Matching "0123" and replacing all matches with themselves wrapped in angle brackets. The policies of 0, 1, and 2 above lead to (computed partly by hand): regex of "[0123]?" 0): "<0><1><2><3><>" 1): "<0><1><2><3><>" 2): "<0><1><2><3>" regex of "[012]?" 0): "<0><1><2>3<>" 1): "<0><1><2>3<>" 2): "<0><1><2>3<>" regex of "[013]?" 0): "<0><1><>23" 1): "<0><1><>2<3><>" 2): "<0><1>2<3>" regex of "[023]?" 0): "<0><>123" 1): "<0><>1<2><3><>" 2): "<0>1<2><3>" regex of "[123]?" 0): "<>0123" 1): "<>0<1><2><3><>" 2): "<>0<1><2><3>" regex of "[03]?" 0): "<0><>123" 1): "<0><>1<>2<3><>" 2): "<0>1<>2<3>" regex of "[03]?" 0): "<0><>123" 1): "<0><>1<>2<3><>" 2): "<0>1<>2<3>" regex of "[12]?" 0): "<>0123" 1): "<>0<1><2><>3<>" 2): "<>0<1><2>3<>" I am leaning to simply changing it from policy 0 to policy 1. Are there any objections? Perhaps I should set a deadline? Now where is that library policy... -- Chris

Ensuring compatibilty with the usage described in "Real World Haskell" would make me really really really happy :) haskell:
Before I go and add or change anything in the Haskell regex-posix library, I wanted to get some feedback. regex-posix provides Text.Regex.Posix, and is built on the regex-base package.
regex-posix is also used as the backend for Text.Regex (the regex-compat package does this). I do not intend to change the behavior of the old Text.Regex API.
The main issue is the behavior when returning a list of all matches of a Regex against a target text. I no longer think the current behavior is the right choice when it comes to zero-length matches.
The current behavior is to return non-overlapping matches with the caveat that after the first zero-length match the search is ended. Note that the zero-length match may be occur at the end position of a previous non-zero-length match.
Notably, no one has complained about this policy. But I no longer like it. So here are a few of my ideas of what to change it to:
0) No change, not worth the effort.
1) return the zero-length match, skip forward 1 character, and continue searching. If the consumer wishes the old policy they can truncate the list. This could also be filtered to resemble option 2 below.
2) Mimic "sed". It seems "sed" has a policy where a zero-length match is forbidden to occur at the end position of a non-zero-length match. "sed" does not stop with the first zero-length match.
3) implement additional execution options, so the user can choose a policy. The default policy choice left with the current behavior.
4) implement additional execution options, so the user can choose a policy. The default policy choice set to he behavior in (1).
5) Return valid matches starting from all positions, including overlapping matches. This I really do not like and one can run the search starting one character after the start of the last match to get this information.
Matching "0123" and replacing all matches with themselves wrapped in angle brackets. The policies of 0, 1, and 2 above lead to (computed partly by hand):
regex of "[0123]?" 0): "<0><1><2><3><>" 1): "<0><1><2><3><>" 2): "<0><1><2><3>"
regex of "[012]?" 0): "<0><1><2>3<>" 1): "<0><1><2>3<>" 2): "<0><1><2>3<>"
regex of "[013]?" 0): "<0><1><>23" 1): "<0><1><>2<3><>" 2): "<0><1>2<3>"
regex of "[023]?" 0): "<0><>123" 1): "<0><>1<2><3><>" 2): "<0>1<2><3>"
regex of "[123]?" 0): "<>0123" 1): "<>0<1><2><3><>" 2): "<>0<1><2><3>"
regex of "[03]?" 0): "<0><>123" 1): "<0><>1<>2<3><>" 2): "<0>1<>2<3>"
regex of "[03]?" 0): "<0><>123" 1): "<0><>1<>2<3><>" 2): "<0>1<>2<3>"
regex of "[12]?" 0): "<>0123" 1): "<>0<1><2><>3<>" 2): "<>0<1><2>3<>"
I am leaning to simply changing it from policy 0 to policy 1.
Are there any objections?
Perhaps I should set a deadline? Now where is that library policy...
-- Chris _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Chris Kuklewicz
Matching "0123" and replacing all matches with themselves wrapped in angle brackets. The policies of 0, 1, and 2 above lead to (computed partly by hand):
regex of "[0123]?" 0): "<0><1><2><3><>" 1): "<0><1><2><3><>" 2): "<0><1><2><3>"
regex of "[012]?" 0): "<0><1><2>3<>" 1): "<0><1><2>3<>" 2): "<0><1><2>3<>"
regex of "[013]?" 0): "<0><1><>23" 1): "<0><1><>2<3><>" 2): "<0><1>2<3>"
regex of "[023]?" 0): "<0><>123" 1): "<0><>1<2><3><>" 2): "<0>1<2><3>"
regex of "[123]?" 0): "<>0123" 1): "<>0<1><2><3><>" 2): "<>0<1><2><3>"
regex of "[03]?" 0): "<0><>123" 1): "<0><>1<>2<3><>" 2): "<0>1<>2<3>"
regex of "[03]?" 0): "<0><>123" 1): "<0><>1<>2<3><>" 2): "<0>1<>2<3>"
regex of "[12]?" 0): "<>0123" 1): "<>0<1><2><>3<>" 2): "<>0<1><2>3<>"
Policy 2 seems the most intuitive to me, I can't imagine a situation where you would want the empty match at the end of a non-empty match. On the other hand, I don't think I've ever used a regular expression that matched an empty string, so it's not particularly important to me.

mail@justinbogner.com wrote:
Policy 2 seems the most intuitive to me, I can't imagine a situation where you would want the empty match at the end of a non-empty match. On the other hand, I don't think I've ever used a regular expression that matched an empty string, so it's not particularly important to me.
The authors of sed are in agreement with your intuition. But I think policy 2 as a recursive definition it is unusual. And I see policy 2 as very hard to summarize. The single-match is always easy to summarize: the leftmost longest match. Policy 1 for multiple matches can be summarized as:
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters.
Policy 2 for multiple matches can be summarized as:
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters, excluding zero-length matches that coincide with the end of a non-zero-length match.
Policy 0 has been:
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters, stopping with the first zero-length match.
Policy 5 would be even simpler in this language:
The longest match at each position where a match occurs.
Policy 5 can be filtered to get policy 1. Policy 1 can be filtered to get policy 0 or to get policy 2. The best practices for a regexp search with multiple matches is to not use a regexp that can match zero-length bits of text (though it may be handy for finding newlines there are simpler ways). So let me also put for "Policy 6"
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters, excluding all zero-length matches.
Note that the only zero-length POSIX extended regular expression matches are () which always matches or ^ or $ or ^$ (which is the same as $^). But the policy I choose here will likely be applied to other regular expression backends that can do fancier testing. Thanks for the comments! Chris

Chris Kuklewicz
The authors of sed are in agreement with your intuition. But I think policy 2 as a recursive definition it is unusual. And I see policy 2 as very hard to summarize.
The single-match is always easy to summarize: the leftmost longest match.
Policy 1 for multiple matches can be summarized as:
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters.
Policy 2 for multiple matches can be summarized as:
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters, excluding zero-length matches that coincide with the end of a non-zero-length match.
Policy 0 has been:
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters, stopping with the first zero-length match.
Given this point, policy 1 does indeed seem the most consistent behaviour. Thanks for the explanation!
participants (3)
-
Chris Kuklewicz
-
Don Stewart
-
mail@justinbogner.com