
mail@justinbogner.com wrote:
Policy 2 seems the most intuitive to me, I can't imagine a situation where you would want the empty match at the end of a non-empty match. On the other hand, I don't think I've ever used a regular expression that matched an empty string, so it's not particularly important to me.
The authors of sed are in agreement with your intuition. But I think policy 2 as a recursive definition it is unusual. And I see policy 2 as very hard to summarize. The single-match is always easy to summarize: the leftmost longest match. Policy 1 for multiple matches can be summarized as:
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters.
Policy 2 for multiple matches can be summarized as:
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters, excluding zero-length matches that coincide with the end of a non-zero-length match.
Policy 0 has been:
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters, stopping with the first zero-length match.
Policy 5 would be even simpler in this language:
The longest match at each position where a match occurs.
Policy 5 can be filtered to get policy 1. Policy 1 can be filtered to get policy 0 or to get policy 2. The best practices for a regexp search with multiple matches is to not use a regexp that can match zero-length bits of text (though it may be handy for finding newlines there are simpler ways). So let me also put for "Policy 6"
All leftmost longest non-overlapping matches, where overlapping means sharing the same matching characters, excluding all zero-length matches.
Note that the only zero-length POSIX extended regular expression matches are () which always matches or ^ or $ or ^$ (which is the same as $^). But the policy I choose here will likely be applied to other regular expression backends that can do fancier testing. Thanks for the comments! Chris