ANN: weighted-regexp-0.1.0.0

older
data.binary get reading beyond end...

Sebastian Fischer

26 Jul 2010 26 Jul '10

3:23 p.m.

Hello, this year's ICFP features A Play on Regular Expressions where two Haskell programmers and an automata theory guru develop an efficient purely functional algorithm for matching regular expressions. A Haskell library based on their ideas is now available from Hackage. For more information (and a link to the play) visit: http://sebfisch.github.com/haskell-regexp/ Cheers, Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Show replies by date

Felipe Lessa

26 Jul 26 Jul

8:48 p.m.

Wow, great paper! I got somewhat scared when I saw the first description of the scene, but after I started reading I couldn't stop anymore =D. Thanks, -- Felipe.

Sjoerd Visscher

11:04 p.m.

Hi Sebastian, I enjoyed this paper very much. Writing papers in the style of a play seems to work very well! (although I think you should spice it up more if your want to get it on Broadway) It seems that only shift needs the reg field of the RegW datatype. So you can also replace the reg field with a shift field. This makes the regexp parser extensible, as there is no longer a dependence on the (closed) datatype Reg:

...

data RegW w c = RegW { active :: !Bool, empty :: !w, final_ :: !w, shift :: w -> c -> RegW w c }

For example it is then easy to define the parser that matches nothing, which is the identity element of alt:

...

noMatch :: RegExp c noMatch = RegExp noMatchW

noMatchW :: Semiring w => RegW w c noMatchW = RegW False zero zero $ \_ _ -> noMatchW

But otherwise I do wonder if the parser needs to be extensible. For example some XML Schema implementations that are based on finite automata have special cases for the xs:all construct, which matches a list of elements, each occurring once in any order. But I tried a straightforward implementation and it works fine:

...

eachOnce :: [RegExp c] -> RegExp c eachOnce [] = eps eachOnce ps = eachOnce' ps [] where eachOnce' [] _ = noMatch eachOnce' (p:ps) qs = (p `seq_` eachOnce (ps ++ qs)) `alt` eachOnce' ps (p:qs)

*Main> accept (eachOnce (map char ['a'..'z'])) $ reverse ['a'..'z'] True (0.05 secs, 8706356 bytes) greetings, Sjoerd

Sebastian Fischer

27 Jul 27 Jul

4:57 a.m.

Hi Sjoerd,

...

It seems that only shift needs the reg field of the RegW datatype. So you can also replace the reg field with a shift field. This makes the regexp parser extensible, as there is no longer a dependence on the (closed) datatype Reg:

...
data RegW w c = RegW { active :: !Bool, empty :: !w, final_ :: !w, shift :: w -> c -> RegW w c }

Interesting observation. However, such an encoding would prevent the definition of some other functions on RegExp. More specifically, there are Show and Eq instances for the QuickCheck tests.

...

For example it is then easy to define the parser that matches nothing, which is the identity element of alt:

...
noMatch :: RegExp c noMatch = RegExp noMatchW

noMatchW :: Semiring w => RegW w c noMatchW = RegW False zero zero $ \_ _ -> noMatchW

Note that you can also define it with the current interface: noMatch :: RegExp c noMatch = psym "(Big Lambda)" (const False) Maybe I'll add it to the next version. I only need a better string representation ;)

...

But otherwise I do wonder if the parser needs to be extensible.

I have some ideas for extending the matcher. For example /a{2,5}/ is currently translated into /aa(a(a(a)?)?)?/ but it may be possible to handle it without such blowup. I also want to add substring matching, i.e., the possibility to find out against which strings parenthesized parts of the regexp were matched. But as the closed Reg type is not exported I can freely change it along with any matcher extension.

...

For example some XML Schema implementations that are based on finite automata have special cases for the xs:all construct, which matches a list of elements, each occurring once in any order. But I tried a straightforward implementation and it works fine:

...
eachOnce :: [RegExp c] -> RegExp c eachOnce [] = eps eachOnce ps = eachOnce' ps [] where eachOnce' [] _ = noMatch eachOnce' (p:ps) qs = (p `seq_` eachOnce (ps ++ qs)) `alt` eachOnce' ps (p:qs)

Neat! That's also worth adding. I find eachOnce :: [RegExp c] -> RegExp c eachOnce = foldr alt noMatch . map (foldr seq_ eps) . permutations even clearer but your version is *much* better as it uses nesting to combine all alternatives that start with the same regexp. Thanks! Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Sebastian Fischer

5:09 a.m.

On Jul 27, 2010, at 6:57 AM, Sebastian Fischer wrote:

...

Maybe I'll add it [noMatch] to the next version. I only need a better string representation ;)

Ha! It's already provided by character classes: ghci> accept (fromString "[]") "abc" False I'll add noMatch :: RegExp c noMatch = psym "[]" (const False) Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Sjoerd Visscher

7:04 a.m.

On Jul 27, 2010, at 7:09 AM, Sebastian Fischer wrote:

...

I find

eachOnce :: [RegExp c] -> RegExp c eachOnce = foldr alt noMatch . map (foldr seq_ eps) . permutations

even clearer but your version is *much* better as it uses nesting to combine all alternatives that start with the same regexp.

Yes, this was what I had at first too, but trying to match this on 8 items takes 2 seconds and 9 items already takes one minute. -- Sjoerd Visscher http://w3future.com

Sjoerd Visscher

7:15 a.m.

On Jul 27, 2010, at 7:09 AM, Sebastian Fischer wrote:

...

I'll add

noMatch :: RegExp c noMatch = psym "[]" (const False)

Oh, by the way, with noMatch, eps, alt and seq_ RegExp is itself a Semiring, but I'm not sure what that would do. -- Sjoerd Visscher http://w3future.com

Eugene Kirpichov

7:30 a.m.

Perhaps this might mean that we can get incremental and parallel regexp matching by associating each character with a linear operator (matrix) over this or related semiring, or something, and mixing that with two sigfpe's articles: http://blog.sigfpe.com/2008/11/approach-to-algorithm-parallelisation.html http://blog.sigfpe.com/2009/01/fast-incremental-regular-expression.html 2010/7/27 Sjoerd Visscher :

...

On Jul 27, 2010, at 7:09 AM, Sebastian Fischer wrote:

...
I'll add

noMatch :: RegExp c noMatch = psym "[]" (const False)

Oh, by the way, with noMatch, eps, alt and seq_ RegExp is itself a Semiring, but I'm not sure what that would do. -- Sjoerd Visscher http://w3future.com

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Eugene Kirpichov Senior Software Engineer, Grid Dynamics http://www.griddynamics.com/

S. Doaitse Swierstra

28 Jul 28 Jul

11:14 a.m.

On 27 jul 2010, at 09:30, Eugene Kirpichov wrote:

...

Perhaps this might mean that we can get incremental and parallel regexp matching by associating each character with a linear operator

This is exactly what is happening in the uu-parsinglib. Doaitse

...

(matrix) over this or related semiring, or something, and mixing that with two sigfpe's articles: http://blog.sigfpe.com/2008/11/approach-to-algorithm-parallelisation.html http://blog.sigfpe.com/2009/01/fast-incremental-regular-expression.html

2010/7/27 Sjoerd Visscher :

...
On Jul 27, 2010, at 7:09 AM, Sebastian Fischer wrote:

...
I'll add

noMatch :: RegExp c noMatch = psym "[]" (const False)

Oh, by the way, with noMatch, eps, alt and seq_ RegExp is itself a Semiring, but I'm not sure what that would do. -- Sjoerd Visscher http://w3future.com

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Eugene Kirpichov Senior Software Engineer, Grid Dynamics http://www.griddynamics.com/ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Eugene Kirpichov

11:17 a.m.

This is very interesting! Could you provide some more info? T.i. where to look in the source, or on the web? 2010/7/28 S. Doaitse Swierstra :

...

On 27 jul 2010, at 09:30, Eugene Kirpichov wrote:

...
Perhaps this might mean that we can get incremental and parallel regexp matching by associating each character with a linear operator

This is exactly what is happening in the uu-parsinglib.

Doaitse

...
(matrix) over this or related semiring, or something, and mixing that with two sigfpe's articles: http://blog.sigfpe.com/2008/11/approach-to-algorithm-parallelisation.html http://blog.sigfpe.com/2009/01/fast-incremental-regular-expression.html

2010/7/27 Sjoerd Visscher :

...
On Jul 27, 2010, at 7:09 AM, Sebastian Fischer wrote:

...
I'll add

noMatch :: RegExp c noMatch = psym "[]" (const False)

Oh, by the way, with noMatch, eps, alt and seq_ RegExp is itself a Semiring, but I'm not sure what that would do. -- Sjoerd Visscher http://w3future.com

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Eugene Kirpichov Senior Software Engineer, Grid Dynamics http://www.griddynamics.com/ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Eugene Kirpichov Senior Software Engineer, Grid Dynamics http://www.griddynamics.com/

S. Doaitse Swierstra

12:53 p.m.

On 28 jul 2010, at 13:17, Eugene Kirpichov wrote:

...

This is very interesting! Could you provide some more info? T.i. where to look in the source, or on the web?

see: file:///Users/doaitse/.cabal/share/doc/uu-parsinglib-2.4.2/html/index.html The README.hs module contains some further references. Doaitse

...

2010/7/28 S. Doaitse Swierstra :

...
On 27 jul 2010, at 09:30, Eugene Kirpichov wrote:

...
Perhaps this might mean that we can get incremental and parallel regexp matching by associating each character with a linear operator

This is exactly what is happening in the uu-parsinglib.

Doaitse

...
(matrix) over this or related semiring, or something, and mixing that with two sigfpe's articles: http://blog.sigfpe.com/2008/11/approach-to-algorithm-parallelisation.html http://blog.sigfpe.com/2009/01/fast-incremental-regular-expression.html

2010/7/27 Sjoerd Visscher :

...
On Jul 27, 2010, at 7:09 AM, Sebastian Fischer wrote:

...
I'll add

noMatch :: RegExp c noMatch = psym "[]" (const False)

Oh, by the way, with noMatch, eps, alt and seq_ RegExp is itself a Semiring, but I'm not sure what that would do. -- Sjoerd Visscher http://w3future.com

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Eugene Kirpichov Senior Software Engineer, Grid Dynamics http://www.griddynamics.com/ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Eugene Kirpichov Senior Software Engineer, Grid Dynamics http://www.griddynamics.com/

Martijn van Steenbergen

3:20 p.m.

On 7/28/10 14:53, S. Doaitse Swierstra wrote:

...

see: file:///Users/doaitse/.cabal/share/doc/uu-parsinglib-2.4.2/html/index.html

Readers might have more luck with the following URLs: http://hackage.haskell.org/package/uu-parsinglib http://hackage.haskell.org/packages/archive/uu-parsinglib/2.4.2/doc/html/Tex... Groetjes, Martijn.

Sebastian Fischer

27 Jul 27 Jul

7:58 a.m.

On Jul 27, 2010, at 9:15 AM, Sjoerd Visscher wrote:

...

Oh, by the way, with noMatch, eps, alt and seq_ RegExp is itself a Semiring,

Yes, but it's hard to define an Eq instance for arbitrary regular expressions that reflects equivalence of regexps. There is currently only `instance Eq (RegExp Char)` which implements structural identity used for the QuickCheck tests.

...

but I'm not sure what that would do.

I think matching a regular expression against a word in the "regular expressions semiring" returns an unfolding of the original regular expression which matches exactly the given word. Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Martijn van Steenbergen

28 Jul 28 Jul

3:20 p.m.

On 7/27/10 9:58, Sebastian Fischer wrote:

...

On Jul 27, 2010, at 9:15 AM, Sjoerd Visscher wrote:

...
Oh, by the way, with noMatch, eps, alt and seq_ RegExp is itself a Semiring,

Yes, but it's hard to define an Eq instance for arbitrary regular expressions that reflects equivalence of regexps.

How hard is this exactly? Martijn.

Sebastian Fischer

3:41 p.m.

...

...
...
Oh, by the way, with noMatch, eps, alt and seq_ RegExp is itself a Semiring,

Yes, but it's hard to define an Eq instance for arbitrary regular expressions that reflects equivalence of regexps.

How hard is this exactly?

The standard algorithm to decide regexp equivalence transforms both expressions into DFAs and checks whether they are equivalent. DFAs can be checked for equivalence by forming their in-equivalence-product and checking whether the result DFA does not accept any input. See http://www.cs.rice.edu/~vardi/papers/sigcse06t1.pdf.gz This algorithm is quadratic in the sizes of the DFAs but they can themselves be exponentially large in the sizes of the regexps. I don't know whether there is more efficient way to decide regexp equivalence. Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Sebastian Fischer

27 Jul 27 Jul

7:50 a.m.

New subject: ANN: weighted-regexp-0.1.1.0

I have released weighted-regexp-0.1.1.0 with two additional combinators: -- | Does not match anything. 'noMatch' is an identity for 'alt'. -- noMatch :: RegExp c -- | -- Matches a sequence of the given regular expressions in any -- order. For example, the regular expression -- -- @ -- perm (map char \"abc\") -- @ -- -- has the same meaning as -- -- @ -- abc|acb|bcc|bac|cba|cab -- @ -- -- and is represented as -- -- @ -- a(bc|cb)|b(ca|ac)|c(ba|ab) -- @ -- perm :: [RegExp c] -> RegExp c Cheers, Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Chris Kuklewicz

28 Jul 28 Jul

11:33 a.m.

On 26/07/2010 16:23, Sebastian Fischer wrote:

...

this year's ICFP features A Play on Regular Expressions where two Haskell programmers and an automata theory guru develop an efficient purely functional algorithm for matching regular expressions.

That is wonderfully clean way to go straight to a DFA! Have you thought about supporting anchors like (^) and ($) ? It would mean the match would have to add some prev & next parameters (or a Reader monad?) to the shift call. In the Play you happen to compare with the native POSIX regex library on Mac OS X. This library happens to be buggy and sometimes fails to find the leftmost longest match! (email footnote [1]) You also wrote

...

I have some ideas for extending the matcher. For example /a{2,5}/ is currently translated into /aa(a(a(a)?)?)?/ but it may be possible to handle it without such blowup. I also want to add substring matching, i.e., the possibility to find out against which strings parenthesized parts of the regexp were matched.

Getting a good specification for the substring rules is non-trivial. When writing regex-tdfa (on hackage) I consulted Glenn Fowler's description at

...

http://www2.research.att.com/~gsf/testregex/re-interpretation.html

which are well-defined rules for POSIX substring capture. No implementation of POSIX that I have tested against fully follows the specification, though Fowler's AT&T code comes very very close (the bugs quickcheck found are also on http://www.haskell.org/haskellwiki/Regex_Posix ). In the spirit of your DFA I am guessing the "mark" type for substring capture would actually be an annotated copy of the whole regular expression tree. My regex-tdfa does not look like your DFA; my code followed Ville Laurikari's master's thesis ( http://laurikari.net/tre/ but his libtre is quite buggy) which uses an array of indexes as the "mark" on each NFA state. But regex-tdfa is quite old and over-complicated and while it is linear in complexity it is not especially fast. I am currently rewriting the algorithm in OCaml (to learn OCaml) and I am trying to avoid the /a{2,5}/ blowup that regex-tdfa has. Cheers, Dr. Chris Kuklewicz [1] At http://www.haskell.org/haskellwiki/Regex_Posix there is a discussion of the regex-posix-unittest (on hackage) results for OS X/FreeBSD. Some critical failures for leftmost longest that I (actually quickcheck) found is: ############################# Unexpected Fail # 1 Searched text: "ab" Regex pattern: "(()|.)(b)" Expected output: "(0,2)(0,1)(-1,-1)(1,2)" Actual result : "(1,2)(1,1)(1,1)(1,2)" ############################# Unexpected Fail # 2 Searched text: "ab" Regex pattern: "(()|[ab])(b)" Expected output: "(0,2)(0,1)(-1,-1)(1,2)" Actual result : "(1,2)(1,1)(1,1)(1,2)" ############################# Unexpected Fail # 3 Searched text: "aaab" Regex pattern: "(()|[ab])+b" Expected output: "(0,4)(2,3)(-1,-1)" Actual result : "(3,4)(3,3)(3,3)" The first (i,j) values are the indexes for the whole match (staring and one-past-the-end). Rational people can argue about the index values in the other substring captures, but the first one for the whole match is simply a bug.

Sebastian Fischer

12:39 p.m.

...

Have you thought about supporting anchors like (^) and ($) ?

We went the opposite route, made full matching the default, and implemented partial matching by pre- and appending .* As there are both a fullMatch and a partialMatch function, I don't see an immediate need for anchors, although I admit that they have the advantage that you can specify *in the regexp* whether you want full or partial matching.

...

You also wrote

...
I also want to add substring matching, i.e., the possibility to find out against which strings parenthesized parts of the regexp were matched.

Getting a good specification for the substring rules is non-trivial. When writing regex-tdfa (on hackage) I consulted Glenn Fowler's description at

...
http://www2.research.att.com/~gsf/testregex/re-interpretation.html

which are well-defined rules for POSIX substring capture.

Thanks for the pointer. I'll consult it when going for submatch extraction.

...

In the spirit of your DFA I am guessing the "mark" type for substring capture would actually be an annotated copy of the whole regular expression tree.

I am not sure whether it is a good idea to implement submatch capture only by means of a different semiring. I would not hesitate to extend the matching algorithm itself, if that leads to a clean implementation.

...

But regex-tdfa is quite old and over-complicated and while it is linear in complexity it is not especially fast. I am currently rewriting the algorithm in OCaml (to learn OCaml) and I am trying to avoid the /a{2,5}/ blowup that regex-tdfa has.

Did you experience problems with such blowup? The re2 library also implements bounded repetitions by blowing them up, so, at least, it doesn't seem to be a big problem for Google. Cheers, Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Chris Kuklewicz

1:33 p.m.

...

As there are both a fullMatch and a partialMatch function, I don't see an immediate need for anchors, although I admit that they have the advantage that you can specify *in the regexp* whether you want full or partial matching.

The REG_NEWLINE flag for compiling POSIX regular expressions is defined as:

...

REG_NEWLINE Compile for newline-sensitive matching. By default, newline is a completely ordinary character with no special meaning in either REs or strings. With this flag, `[^' bracket expressions and `.' never match newline, a `^' anchor matches the null string after any newline in the string in addition to its normal function, and the `$' anchor matches the null string before any newline in the string in addition to its normal function.

Using ^ and $ to match near newlines makes matching patterns that cross lines much easier to control. But anchors are also a source of bugs for some implementations, e.g. libtre.

...

I am not sure whether it is a good idea to implement submatch capture only by means of a different semiring. I would not hesitate to extend the matching algorithm itself, if that leads to a clean implementation.

The "tag" idea that I copy from Ville's thesis is that each NFA state holds a history. This history is your expression tree with each node (i.e. sub-pattern) recording the most recent input index where the sub-pattern was entered and when it was exited. During matching two such histories will be "added" if they arrive at the same NFA state, where "adding" compares the sub-patterns in priority order to choose which of the two histories is kept and which is discarded. The problem is keeping the histories bounded and the answer correct. Making this work efficiently (linear in time, constant in space) and match POSIX substrings has been hard. Doing things like expanding /a{2,5}/ helps. The thing that makes constant space hard is that /a*/ cannot be expanded as the blow-up is unbounded. Expanding during execution is still linear in space: one needs to record the length of each iteration of the sub-pattern /a/ in an unbounded list. Ville's libtre cheats and gets the wrong answer by using only the length of the final iteration. With only the last length one can only maximize (or minimize) the last iteration of /a*/. The re-interpretation specification makes it clear that the length of each iteration from left to right ought to be greedy, constrained by the overall left-long match. Thus one _needs_ to use the length of _all_ iterations to be sure of getting the correct POSIX substring match for the last iteration. The original regex-tdfa took linear space and got the right answer. The current regex-tdfa does a clever thing to compress the list of lengths and thus runs constant space. The clever trick while matching is, schematically: 1) Ever 100 characters pause and perform a "compression", which is 2) Go though the NFA states and sort the lists of repetition lengths 3) Replace the lists of repetition lengths with a single integer which is the ordinal position found by the sorting 4) Resume matching, building up the lists again for the next 100 characters...goto step 1 Thus the lists are bounded at 100 entries at the cost of breaking the abstraction that the NFA "marks" only interact with each other when they collide. The compression step pre-sorts the "mark" histories to decide the result of a possible future collision. It is implemented as a time/space trade-off. A more elegant re-implementation might do away with the separate NFA "mark" abstraction and keep them always sorted...

...

I am currently rewriting the

...
algorithm in OCaml (to learn OCaml) and I am trying to avoid the /a{2,5}/ blowup that regex-tdfa has.

Did you experience problems with such blowup? The re2 library also implements bounded repetitions by blowing them up, so, at least, it doesn't seem to be a big problem for Google.

Blowup in regex-tdfa is a problem for large blowups. The constant bounded history space required for a length 'm' expression is O(m^2) because of the substring capturing information. And I use a lazy DFA so I am vulnerable to making a graph of size 2^m until the whole expression gets garbage collected. I am impressed by how your lazy Haskell code handles infinite 'm'! The OCaml re-implementation of regex-tdfa should fight this space blow-up by using an one-the-fly DFA like yours and by not expanding /a+/ /a{i,}/ and /a{i,k}/ expressions. But this is unfinished. Cheers, Dr. Chris Kuklewicz

Sebastian Fischer

8:38 p.m.

...

...
REG_NEWLINE Compile for newline-sensitive matching. By default, newline is a completely ordinary character with no special meaning in either REs or strings. With this flag, `[^' bracket expressions and `.' never match newline, a `^' anchor matches the null string after any newline in the string in addition to its normal function, and the `$' anchor matches the null string before any newline in the string in addition to its normal function.

Using ^ and $ to match near newlines makes matching patterns that cross lines much easier to control. But anchors are also a source of bugs for some implementations, e.g. libtre.

Maybe I underestimated the utility of ^ and $. The definition seems intricate. I thought about adding a combinator for matching newline but now think that would lead to wrong start and end positions. For example the start position of the matching substring for ^a in "a \na" should be 2 not 1, right? Or is it 0 although there is no newline at the beginning? Is there a page with examples that show how ^ and $ should behave exactly? Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Chris Kuklewicz

10:07 p.m.

...

Maybe I underestimated the utility of ^ and $. The definition seems intricate. I thought about adding a combinator for matching newline but now think that would lead to wrong start and end positions. For example the start position of the matching substring for ^a in "a\na" should be 2 not 1, right? Or is it 0 although there is no newline at the beginning?

The first "a" would match with indexes (0,1) and the second "a" would match with indexes (1,2).

...

Is there a page with examples that show how ^ and $ should behave exactly?

Without REG_NEWLINE the meanings are: . matches any single character (though note that handling of a zero byte is impossible for C style strings for a different reason). ^ is an assertion that instead of being AlwaysTrue (eps)or AlwaysFalse (noMatch) is true before any characters have been accepted and false afterward. $ is an assertion that is true only when there are no more characters to match and false before this. With REG_NEWLINE the meanings are: . matches any single character EXCEPT '\n' newline (ASCII 10, I think). ^ is true before any characters have been matched and true right after a newline has been matched, else false. ^ is true when there are no more characters to match and true if the next character to match is a newline, else false. Let 'a' and 'b' and 'c' be some complicated regular expressions that cannot accept a newline with REG_NEWLINE enabled: ^$ finds blank lines, the indexes between newlines or between a newline and the start or end of the text. ^a$ requires 'a' to exactly fill a line and the captured string has no newlines. A more complicated use, perhaps as part of a crazy parser: "(a(\n)?)(^|b)(c|$)" has 'a' much some text and perhaps the newline. If the newline was there then the ^ matches and b might be skipped, otherwise b must be used. The match ends with '(c|$)' is thus either starting the new line or trailing b. And (c|$) can avoid matching 'c' if the next character is a newline. Note that the regular expression "(^|[aA])" has a non-trivial "can_accept_empty" property: it can sometimes accept empty. And if you are recording parenthetical captures then "(^)?" is subtle. When ^ is true the (^) succeeds like () and when it is false it does not. This inserts a test into the pattern that can be checked later. And "((^$)|(^)|($))" is worse: it does not always succeed and which sub-pattern gets captured depends on the presence of one or two newlines. In "((^)|(^$))" it is impossible for (^$) to be used since the first (^) will always be favored by the POSIX rules. Similarly "(()|(^))" will never use (^). A small chunk of regex-tdfa sifts through the possible ways to accept 0 characters for each node in the parse-tree and keeps an ordered list of sets of assertions to check, and cleans outs those that are logically excluded. Slightly more useful anchors are added in Perl/PCRE:

...

ANCHORS AND SIMPLE ASSERTIONS \b word boundary \B not a word boundary ^ start of subject also after internal newline in multiline mode \A start of subject $ end of subject also before newline at end of subject also before internal newline in multiline mode \Z end of subject also before newline at end of subject \z end of subject \G first matching position in subject

I added \b \B as above, and added \` \' to be like \A and \Z above, and added \< and \> to be beginning and end of word assertions. With enough assertions and negated assertions one could level up to using a binary decision diagram to express when a sub-pattern can accept 0 characters. Ville's libtre gets this wrong:

...

Searched text: "searchme" Regex pattern: "((s^)|(s)|(^)|($)|(^.))*" Expected output: "(0,1)(0,1)(-1,-1)(0,1)(-1,-1)(-1,-1)(-1,-1)" Actual result : "(0,1)(0,1)(-1,-1)(0,1)(1,1)(1,1)(-1,-1)" And sometimes very wrong: Searched text: "searchme" Regex pattern: "s(^|())e" Expected output: "(0,2)(1,1)(1,1)" Actual result : "NOMATCH"

Cheers, Dr. Chris Kuklewicz

Sebastian Fischer

29 Jul 29 Jul

10:18 a.m.

Hello Chris, thanks for the examples. They show that by adding anchors, the meaning of regular expressions is no longer compositional. For example (^|a) accepts the empty word only if no characters have been read yet. So (^|a) =~ "" does hold but a(^|a) =~ "a" does not hold although a =~ "a" does. I deem compositionality an important property to compare the current implementation with its specification. As, currently, I don't know how to incorporate anchors without making a mess, I refrain from adding them for the moment.

...

A more complicated use, perhaps as part of a crazy parser: "(a(\n)?)(^|b)(c|$)"

That's a nice example and I don't know how to express this without anchors. Cheers, Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Sebastian Fischer

28 Jul 28 Jul

2:21 p.m.

New subject: ANN: weighted-regexp-0.2.0.0

I have uploaded version 0.2.0.0 of the weighted-regexp package for weighted regular expression matching to Hackage. The changes are briefly 0.2.0.0 More general types for matching functions Renamed accept to acceptFull, added acceptPartial Strict numeric semiring SPECIALIZE pragmas prevent memory leak Fixed mistake in Criterion benchmarks 0.1.1.0 added noMatch added perm and detailed at: http://sebfisch.github.com/haskell-regexp/CHANGES.html Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Benedikt Huber

10:47 p.m.

Sebastian Fischer schrieb:

...

Hello,

this year's ICFP features A Play on Regular Expressions where two Haskell programmers and an automata theory guru develop an efficient purely functional algorithm for matching regular expressions.

A Haskell library based on their ideas is now available from Hackage. For more information (and a link to the play) visit:

http://sebfisch.github.com/haskell-regexp/ Hello,

thanks for the great paper and its unusual style, I enjoyed reading it a lot! Taking a quick look at the PyPy blog post on JIT code generation for regular expressions, I thought it would be fun to implement a generator using the excellent LLVM bindings for haskell. The attached prototype does not scale for larger regexp, is mostly untested and probably unoptimized, but is quite a bit faster than ruby and python's re, with the code generation core only spanning 25 lines. Here is a biased (because the LLVM stuff needs to use bytestrings) and completely unrepresentative comparison for the "even number of c's" regular expressions:

...

ruby -e 'print("accbccacbc" * 10000000)' > test.in

...

# Using weighted-regexp-0.2.0.0 and the RE from the paper time ./TestWeightedRegexp < test.in # using the RE from the paper 58.34 real 56.65 user 1.10 sys

...

time ruby -e 'gets =~ /\A(?:[ab]*c[ab]*c)*[ab]*\Z/' < test.in 7.42 real 6.23 user 1.03 sys

...

time ./TestRegExpLLVM < test.in # using the RE from the paper 1.37 real 1.14 user 0.15 sys

For large regular expressions, the generated bitcode serves as a good stress test for LLVM's backend ;) Anyway, really entertaining stuff, thank you. cheers, benedikt -- Demo: LLVM Regexp matcher; (c) 2010, Benedikt Huber import Prelude hiding (and,or) import Control.Monad import Control.Monad.State import qualified Data.ByteString as BS import Data.ByteString.Internal (toForeignPtr) import Data.Char (ord) import Data.Word import Data.Int import Foreign.Ptr (plusPtr) import Foreign.ForeignPtr (withForeignPtr) import Foreign.Marshal.Array import LLVM.Core import LLVM.ExecutionEngine import LLVM.Util.Loop import LLVM.Util.File import System.IO.Unsafe data Reg c = Eps | Sym c | Opt (Reg c) | Alt (Reg c) (Reg c) | Seq (Reg c) (Reg c) | Rep (Reg c) empty :: Reg c -> Bool empty r = case r of Sym _ -> False Alt r1 r2 -> empty r1 || empty r2 Seq r1 r2 -> empty r1 && empty r2 _ -> True count :: Reg c -> Int count re = case re of Sym _ -> 1 Alt r1 r2 -> count r1 + count r2 Seq r1 r2 -> count r1 + count r2 Opt r -> count r Eps -> 0 Rep r -> count r number :: Reg c -> Reg (c, Int) number = (evalState `flip` 0) . numberM where ticket = State $ \s -> (s,s+1) numberM re = case re of Sym c -> liftM (Sym . (,) c) ticket Alt r1 r2 -> liftM2 Alt (numberM r1) (numberM r2) Seq r1 r2 -> liftM2 Seq (numberM r1) (numberM r2) Opt r -> liftM Opt (numberM r) Eps -> return Eps Rep r -> liftM Rep (numberM r) finalStates :: Reg (c,Int) -> [Int] finalStates re = case re of Sym (_,n) -> [n] Alt r1 r2 -> finalStates r1 ++ finalStates r2 Seq r1 r2 -> [ v | v <- finalStates r1, empty r2 ] ++ finalStates r2 Opt r -> finalStates r Eps -> [] Rep r -> finalStates r runOnString :: (Ptr Word8 -> Word32 -> IO Word32) -> BS.ByteString -> IO Bool runOnString f bs = do let (fptr,offset,len) = toForeignPtr bs r <- withForeignPtr fptr $ \ptr -> f (ptr `plusPtr` offset) (fromIntegral len) return (r > 0) regexMatcher :: Reg Char -> CodeGenModule (Function (Ptr Word8 -> Word32 -> IO Word32)) regexMatcher re = do createNamedFunction ExternalLinkage "matcher" $ \string len -> do let arrSize = fromIntegral (count re) :: Word32 arr <- arrayAlloca arrSize :: CodeGenFunction r (Value (Ptr Word32)) -- allocate arr forLoop (w32 0) (w32 arrSize) () $ \ix _ -> do -- memset all elements in arr to 0 aptr <- getElementPtr arr (ix, ()) store (valueOf 0) aptr top <- getCurrentBasicBlock -- loop initialization stuff loop <- newBasicBlock body <- newBasicBlock exit <- newBasicBlock br loop defineBasicBlock loop first <- phi [(valueOf True, top)] -- initially, first is True final <- phi [(valueOf (if empty re then True else False), top)] -- initially, top is True if re accepts eps strIx <- phi [(w32 0, top)] -- i = 0 t <- icmp IntEQ strIx len -- exit if i==len condBr t exit body defineBasicBlock body -- Define the loop body strp <- getElementPtr string (strIx, ()) -- get character ch <- load (strp :: Value (Ptr (Word8))) -- ch = str[i] generateRegexpCode re first arr ch -- generate regexp matcher code final' <- genFinalStateCheck (finalStates $ number re) arr (valueOf False) -- check whether we are in final state strIx_next <- add strIx (w32 1) -- add 1 to string index addPhiInputs strIx [(strIx_next, body)] addPhiInputs first [(valueOf False, body)] -- first = false from second iteration on addPhiInputs final [(final', body)] br loop -- and loop defineBasicBlock exit final_w32 <- zext final -- LLVM bindings currently do not have Generic Bool ret $ (final_w32 :: Value Word32) where w32 v = valueOf v :: Value Word32 generateRegexpCode :: Reg Char -> Value Bool -> Value (Ptr Word32) -> Value Word8 -> CodeGenFunction r (Value Bool) generateRegexpCode re first bitmask ch = genC first (number re) where genC :: Value Bool -> Reg (Char,Int) -> CodeGenFunction r (Value Bool) genC next nre = case nre of Sym (c,n) -> do tmp1 <- icmp IntEQ ch (valueOf (fromIntegral (ord c) :: Word8)) tmp2 <- and next tmp1 let nIx = fromIntegral n :: Word32 bp <- getElementPtr bitmask (nIx, ()) r <- load bp >>= trunc tmp3 <- zext tmp2 :: CodeGenFunction r (Value Word32) store tmp3 bp return r Seq r1 r2 -> do next1 <- genC next r1 next2 <- genC next1 r2 next2 `or` (if (empty r2) then next1 else valueOf False) Alt r1 r2 -> do next1 <- genC next r1 next2 <- genC next r2 tmp <- next1 `or` next2 tmp `or` (if (empty r1 || empty r2) then next else valueOf False) Rep r -> do next' <- genFinalStateCheck (finalStates r) bitmask next tmp <- genC next' r tmp `or` next Opt r -> do next1 <- genC next r next1 `or` next Eps -> do return next genFinalStateCheck :: [Int] -> Value (Ptr Word32) -> Value Bool -> CodeGenFunction r (Value Bool) genFinalStateCheck [] _ b = return b genFinalStateCheck (n:ns) arr b = do aptr <- getElementPtr arr (fromIntegral n :: Word32, ()) finalState <- load aptr >>= trunc tmp <- b `or` (finalState :: Value Bool) genFinalStateCheck ns arr tmp evencs :: Reg Char evencs = Seq (Rep (Seq onec onec)) nocs where nocs = Rep (Alt (Sym 'a') (Sym 'b')) onec = Seq nocs (Sym 'c') main :: IO () main = do initializeNativeTarget matches <- liftM ((unsafePerformIO.) . runOnString) (simpleFunction (regexMatcher evencs)) input <- BS.getContents forM_ (BS.split (fromIntegral (ord '\n')) input) $ \line -> do putStrLn (if matches line then "Match" else "No Match")

Sebastian Fischer

29 Jul 29 Jul

7:16 a.m.

On Jul 29, 2010, at 12:47 AM, Benedikt Huber wrote:

...

Taking a quick look at the PyPy blog post on JIT code generation for regular expressions, I thought it would be fun to implement a generator using the excellent LLVM bindings for haskell.

Interesting. Would you mind elaborating on your code? Maybe even write a blog about how it works? Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Sebastian Fischer

7:22 a.m.

...

Maybe even write a blog about how it works?

A whole blog is probably unnecessary ;) But a blog *post* would be nice! -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Benedikt Huber

30 Jul 30 Jul

1:07 a.m.

Sebastian Fischer schrieb:

...

On Jul 29, 2010, at 12:47 AM, Benedikt Huber wrote:

...
Taking a quick look at the PyPy blog post on JIT code generation for regular expressions, I thought it would be fun to implement a generator using the excellent LLVM bindings for haskell.

Interesting. Would you mind elaborating on your code? Maybe even write a blog about how it works?

Hi, so the current implementation (a little bit cleaned up at http://github.com/visq/llvm-regexp ), works as follows: It generates an LLVM function which matches a bytestring against a given regular expression. The state of the 'automaton' consists of one bit for each leaf of the regexp AST, corresponding to the marks in the article's figure. It then generates straight-line code updating the state given an input character (generateRegexpCode and matchCharSet in Text.RegExp.LLVM). For example, for matching '.a' , the generated code looks like this in a simplified pseudo code notation:

...

... next ~ first character matched ... ch ~ input character next1 = bitmask[0] -- was '.' marked ? bitmask[0] = next -- mark '.' if initial next2 = bitmask[1] -- was 'a' marked ? bitmask[1] = ch == 'a' && next1 -- mark 'a' if '.' was marked

-- and input is 'a' At the end of the string, code is generated to check whether the automaton is in a final state (genFinalStateCheck). In the example above, this corresponds to

...

final = bitmask[1]

The rest is either LLVM boilerplate (regexMatcher) or adaptions from weighed-regexp. Additionally, I've adapted the Parser.y from weighted-regexp, but some things (e.g. character classes like \w) are not implemented. Generating one big basic block for the whole regular expressions does not work well when there are more than a few thousand nodes in the AST. Using functions for large regular expressions and loop for repititions would be one solution. The other problem is that the matcher only operates on Word8s at the moment. Trying it out (if you have llvm+haskell bindings) is also easy (do not cabal-easy):

...

$ git clone git@github.com:visq/llvm-regexp.git && cd llvm-regexp $ make $ ./Grep '.*spotlight.*' < /etc/passwd < Line 45 matches

cheers, benedikt

5451

Age (days ago)

5455

Last active (days ago)

List overview

Download

26 comments

8 participants

participants (8)

Benedikt Huber
Chris Kuklewicz
Eugene Kirpichov
Felipe Lessa
Martijn van Steenbergen
S. Doaitse Swierstra
Sebastian Fischer
Sjoerd Visscher