regex-applicative library needs your help! (algorithmic challenge)

Please help make the regex-based parsing library efficient! https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help -- Roman I. Cheplyaka :: http://ro-che.info/

I wrote regex-tdfa, the efficient (though not yet lightning fast) Posix-like engine. You are not the first to want an efficient Perl-like engine. The answer you seek flows from Russ Cox (though in C++):
http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-reg...
Quoting relevant bit:
It also finds the leftmost-first match, the same match that Perl would, and can return submatch information. The one significant exception is that RE2 drops support for backreferences¹ and generalized zero-width assertions, because they cannot be implemented efficiently.
On 13/09/2011 06:40, Roman Cheplyaka wrote:
Please help make the regex-based parsing library efficient!
https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help

* Chris Kuklewicz
I wrote regex-tdfa, the efficient (though not yet lightning fast) Posix-like engine.
You are not the first to want an efficient Perl-like engine. The answer you seek flows from Russ Cox (though in C++):
http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-reg...
Quoting relevant bit:
It also finds the leftmost-first match, the same match that Perl would, and can return submatch information. The one significant exception is that RE2 drops support for backreferences¹ and generalized zero-width assertions, because they cannot be implemented efficiently.
Hi Chris, thanks for the response. There's one thing about Cox's work that I don't understand. On the page [1] he mentions that the algorithm falls back to direct NFA simulation (search for "If all else fails" on that page). This defeats his goal of defending against DoS attacks (the second paragraph of that page). And I wouldn't be comfortable with an algorithm that is worst-case exponential either. Then there's another issue: I specifically want a combinator library, and not every automaton-based algorithm can serve this purpose in a statically typed language (unless there's some clever trick I don't know). [1]: http://swtch.com/~rsc/regexp/regexp3.html -- Roman I. Cheplyaka :: http://ro-che.info/

* Roman Cheplyaka
Then there's another issue: I specifically want a combinator library, and not every automaton-based algorithm can serve this purpose in a statically typed language (unless there's some clever trick I don't know).
Just to clarify: by "clever" trick I don't mean a "dirty" trick, like using unsafeCoerce in frisby[1]. I guess I'm just too young and naive to use that kind of stuff. Glushkov automaton is nice because it can be used to implement parsing combinators in pure Haskell + GADTs. [1]: http://hackage.haskell.org/packages/archive/frisby/0.1/doc/html/src/Text-Par... -- Roman I. Cheplyaka :: http://ro-che.info/

Hi,
I don't see how fallback to NFA simulation is really a failure wrt DoS
attacks. It's not exponential in time or memory, just linear memory
(in size of regex) instead of constant, and slower than DFA.
On Wed, Sep 14, 2011 at 1:29 AM, Roman Cheplyaka
* Chris Kuklewicz
[2011-09-13 08:21:55+0100] I wrote regex-tdfa, the efficient (though not yet lightning fast) Posix-like engine.
You are not the first to want an efficient Perl-like engine. The answer you seek flows from Russ Cox (though in C++):
http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-reg...
Quoting relevant bit:
It also finds the leftmost-first match, the same match that Perl would, and can return submatch information. The one significant exception is that RE2 drops support for backreferences¹ and generalized zero-width assertions, because they cannot be implemented efficiently.
Hi Chris, thanks for the response.
There's one thing about Cox's work that I don't understand. On the page [1] he mentions that the algorithm falls back to direct NFA simulation (search for "If all else fails" on that page). This defeats his goal of defending against DoS attacks (the second paragraph of that page).
And I wouldn't be comfortable with an algorithm that is worst-case exponential either.
Then there's another issue: I specifically want a combinator library, and not every automaton-based algorithm can serve this purpose in a statically typed language (unless there's some clever trick I don't know).
[1]: http://swtch.com/~rsc/regexp/regexp3.html
-- Roman I. Cheplyaka :: http://ro-che.info/
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/

* Eugene Kirpichov
Hi, I don't see how fallback to NFA simulation is really a failure wrt DoS attacks. It's not exponential in time or memory, just linear memory (in size of regex) instead of constant, and slower than DFA.
Hi Eugene, thanks for pointing that out. Indeed, I now see that he uses breadth-first rather than depth-first search. Then he has the same problem as I do. I shall study his code more closely. -- Roman I. Cheplyaka :: http://ro-che.info/

If you are worried about Denial Of Service attacks then you are worried about (1) memory explosion, and (2) long runtime. Long runtime can and should be solved by running a timeout connected to a thread killer, not at the level of the regular expression engine. The memory explosion is more interesting, and here the engine's tradeoffs have something to say: Since I know simple POSIX regular expression matching better, here is an overview of this without subexpression capture: (1) The full DFA may be exponentially large and thus cannot be built ahead of time. (2) Building the DFA in a lazy fashion will cause memory to grow linearly, which will be a problem if left to run too long. (2.1) Pure lazy DFA building won't have a way to determine the currently computed side of the DFA, and is not going to help you. (2.2) Cheating the "pure" with unsafePerformIO might be able to keep a count of the number of computed DFA nodes. Good luck with thread safety. (2.3) Making it a stateful expanding DFA would allow you to control and monitor the growing number of DFA states. (2.4) Adjust the stateful DFA to keep a limited cache of computed DFA nodes. (3) Building an NFA from your regular expression can be cheap, but you would have be more clever than expanding "a(b{2,99}c){100}d" to have 100 copies of 99 copies of b. (4) Using an NFA with multiple states (breadth-first) is like the limit of a DFA with only a single computed node at a time. The number of states active in the pure NFA can be easily counted and bounded to keep the system from using too much memory. Without the lazy or stateful caching of DFA states this may be slightly slower, depending on whether the caches states would be re-used heavily or not. If you want POSIX subexpression capture then your memory required go up by a factor that is, worst case, proportional to the regular expression size. This worst case factor for the "a(b{2,99}c){100}d" does have to track 100 copies of 99 copies of the tracking information. ( The size of the NFA can still be clever when adding capture, but the tracked information you now need cannot be so clever ). The Russ Cox engine in re2 uses NFA/DFA techniques to achieve Perl semantics, where it can check the possibilities in parallel. The same problems with exploding DFA size and tracked information may still apply. On 14/09/2011 07:32, Roman Cheplyaka wrote:
* Eugene Kirpichov
[2011-09-14 08:38:10+0400] Hi, I don't see how fallback to NFA simulation is really a failure wrt DoS attacks. It's not exponential in time or memory, just linear memory (in size of regex) instead of constant, and slower than DFA.
Hi Eugene, thanks for pointing that out.
Indeed, I now see that he uses breadth-first rather than depth-first search. Then he has the same problem as I do. I shall study his code more closely.

Chris, Thank you for an interesting overview. However, I'm not worried directly about DoS. I just want to build a regex library which would be convenient to use for parsing regular languages (by providing applicative interface and Perl semantics), and not worse than alternatives performance-wise (including worst-case time and space complexity).
(3) Building an NFA from your regular expression can be cheap, but you would have be more clever than expanding "a(b{2,99}c){100}d" to have 100 copies of 99 copies of b.
Are there any known algorithms more clever than that? -- Roman I. Cheplyaka :: http://ro-che.info/

So you want to encode priorities efficiently as far as I understand
from [1]? Could bit-packing combined with prefix elimination do the
trick? Choice boils down to binary choice. Attach a number N to every
execution thread that sits in a given NFA state. When the thread moves
through a binary choice state, it splits into two threads with
N_{left} = 2 * N + 1 and N_{right} = 2 * N. When two threads arrive at
the same NFA state, the machine picks one with greater N and discards
the other, thus giving priority to early over late and left over
right. This makes these numbers grow quickly, but at any point in the
simulation one can identify the common binary prefix of all the thread
numbers and remove it, because it will not be relevant for comparison.
If this is too costly, amortize and do it every K steps instead of on
every step.
Anton
[1] https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help
On Tue, Sep 13, 2011 at 1:40 AM, Roman Cheplyaka
Please help make the regex-based parsing library efficient!
https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help
-- Roman I. Cheplyaka :: http://ro-che.info/
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Kind Regards, Anton Tayanovskyy WebSharper™ - type-safe JavaScript in F# http://intellifactory.com

By the way, can Haskell have a type that admits regular expression and
only those? I mostly do ML these days, so trying to write up a regex
types in Haskell I was unpleasantly surprised to discover that there
are all sorts of exotic terms inhabiting it, which I did not have to
worry about in ML. Besides `undefined` you can have for terms that try
to define a grammar with nested parentheses. Using regex-applicative
syntax:
expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"
Is this the so-called 'paucity of types' [1]? What do you do, as a
library designer? Return bottom or try to detect and error out for
these kind of values?
Thanks,
A
[1] http://existentialtype.wordpress.com/tag/recursion/
On Sat, Sep 17, 2011 at 9:46 PM, Anton Tayanovskyy
So you want to encode priorities efficiently as far as I understand from [1]? Could bit-packing combined with prefix elimination do the trick? Choice boils down to binary choice. Attach a number N to every execution thread that sits in a given NFA state. When the thread moves through a binary choice state, it splits into two threads with N_{left} = 2 * N + 1 and N_{right} = 2 * N. When two threads arrive at the same NFA state, the machine picks one with greater N and discards the other, thus giving priority to early over late and left over right. This makes these numbers grow quickly, but at any point in the simulation one can identify the common binary prefix of all the thread numbers and remove it, because it will not be relevant for comparison. If this is too costly, amortize and do it every K steps instead of on every step.
Anton
[1] https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help
On Tue, Sep 13, 2011 at 1:40 AM, Roman Cheplyaka
wrote: Please help make the regex-based parsing library efficient!
https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help
-- Roman I. Cheplyaka :: http://ro-che.info/
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Kind Regards, Anton Tayanovskyy
WebSharper™ - type-safe JavaScript in F# http://intellifactory.com
-- Kind Regards, Anton Tayanovskyy WebSharper™ - type-safe JavaScript in F# http://intellifactory.com

On Sat, 2011-09-17 at 22:11 -0400, Anton Tayanovskyy wrote:
By the way, can Haskell have a type that admits regular expression and only those? I mostly do ML these days, so trying to write up a regex types in Haskell I was unpleasantly surprised to discover that there are all sorts of exotic terms inhabiting it, which I did not have to worry about in ML. Besides `undefined` you can have for terms that try to define a grammar with nested parentheses.
I'm not sure that I understand exactly what the question is... but if you want to recover the set of values in ML types, you need only add an '!' before each of the fields of each of your constructors, marking them as strict fields. Sadly (but unsurprisingly), this does require using a different version of lists and other fundamental types as well. You'll be left with bottom, of course, but in strict-language thinking, that's nothing but an acknowledgement that a function returning that type may diverge... something you can't exclude in Haskell *or* ML, it's just that ML has a different natural way of thinking about it. That way lies Agda and Coq. So you get the next-closest thing: a so-called flat type, whose values are bottom, and a single layer above it in the information ordering. If I've misunderstood the question, my apologies... I haven't actually been reading this thread. -- Chris Smith

* Anton Tayanovskyy
By the way, can Haskell have a type that admits regular expression and only those? I mostly do ML these days, so trying to write up a regex types in Haskell I was unpleasantly surprised to discover that there are all sorts of exotic terms inhabiting it, which I did not have to worry about in ML. Besides `undefined` you can have for terms that try to define a grammar with nested parentheses. Using regex-applicative syntax:
expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"
Is this the so-called 'paucity of types' [1]? What do you do, as a library designer? Return bottom or try to detect and error out for these kind of values?
Thanks,
A
Essentially, this is an "infinite" regular expression. I think it'd be possible to enforce this on the type level. (There was a thread here some time ago about a type of finite lists.) If a library can work with such infinite trees, and the user is happy with resulting performance, then, why not. For example, this is the case with weighted-regexp library, and also with the first (and buggy) version of regex-applicative [2]. Otherwise, just let the user know that this function doesn't work with infinite data structures. Note that in the presence of referential transparency (i.e. without StableNames or similar tools) it is not possible to decide at runtime if a value is circular or not. [2] https://github.com/feuerbach/regex-applicative/commit/95ddfbbfb01c47e292dc7f... -- Roman I. Cheplyaka :: http://ro-che.info/

On Sat, Sep 17, 2011 at 22:11, Anton Tayanovskyy < anton.tayanovskyy@gmail.com> wrote:
By the way, can Haskell have a type that admits regular expression and only those? I mostly do ML these days, so trying to write up a regex types in Haskell I was unpleasantly surprised to discover that there are all sorts of exotic terms inhabiting it, which I did not have to worry about in ML. Besides `undefined` you can have for terms that try to define a grammar with nested parentheses. Using regex-applicative syntax:
expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"
The general case for this is solving the Halting Problem, so neither Haskell nor any other language can do it. You can disallow infinite values by encoding the length into the type, but (a) in Haskell this is painful and (b) runtime values now require runtime types, which you can accommodate but at the price of reintroducing the problems you are trying to prevent. (Dependently typed languages work better for this, but I suspect the result is rather more draconian than you intend.) -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

Chris, Brandon, thank you for the input. I believe I understand what
you are saying; to reiterate, yes, in the *general case*, neither ML
nor Haskell types outrule nastiness such as non-termination. Yes I
know about and use Coq a bit. However, ML has an important *special
case*: not using functions. Chris, using bang patterns in Haskell do
not quite help to statically rule out recursive terms, do they? They
just make for a different runtime error. If I said:
data Regex =
...
| Seq !Regex !Regex
Will it actually give the user who tries to construct a recursive
value a compile-time error? I suspect not, I'll try it myself to make
sure.
Roman, thanks, I will try to look that thread up - this is what I am
indeed curious about - how to enforce this at the type level.
Yes, with a vanilla interface sharing is not observable; some
libraries work with it and some do not. I am spooked out by this
though: performance guarantees are very different for regular terms
and terms with sharing. Consider the regex engine. The "infinite"
regular expressions (well, they are not really regular), may suddenly
produce an infinte NFA, so memory use may become linear in the input
string. Yes, a smart user will understand this, especially in an area
that is relatively well-understood such as regular expressions.
However I find it extremely disturbing that library designers do not
communicate requirements through the type system, and make performance
guarantees for all well-typed programs. ML makes it easy, I guess I am
just an ML convert :)
Thanks,
Anton
On Sun, Sep 18, 2011 at 11:28 AM, Brandon Allbery
On Sat, Sep 17, 2011 at 22:11, Anton Tayanovskyy
wrote: By the way, can Haskell have a type that admits regular expression and only those? I mostly do ML these days, so trying to write up a regex types in Haskell I was unpleasantly surprised to discover that there are all sorts of exotic terms inhabiting it, which I did not have to worry about in ML. Besides `undefined` you can have for terms that try to define a grammar with nested parentheses. Using regex-applicative syntax:
expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"
The general case for this is solving the Halting Problem, so neither Haskell nor any other language can do it. You can disallow infinite values by encoding the length into the type, but (a) in Haskell this is painful and (b) runtime values now require runtime types, which you can accommodate but at the price of reintroducing the problems you are trying to prevent. (Dependently typed languages work better for this, but I suspect the result is rather more draconian than you intend.) -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms
-- Kind Regards, Anton Tayanovskyy WebSharper™ - type-safe JavaScript in F# http://intellifactory.com

On Sep 18, 2011, at 11:28 AM, Brandon Allbery
On Sat, Sep 17, 2011 at 22:11, Anton Tayanovskyy
wrote: By the way, can Haskell have a type that admits regular expression and only those? I mostly do ML these days, so trying to write up a regex types in Haskell I was unpleasantly surprised to discover that there are all sorts of exotic terms inhabiting it, which I did not have to worry about in ML. Besides `undefined` you can have for terms that try to define a grammar with nested parentheses. Using regex-applicative syntax: expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"
The general case for this is solving the Halting Problem, so neither Haskell nor any other language can do it. You can disallow infinite values by encoding the length into the type, but (a) in Haskell this is painful and (b) runtime values now require runtime types, which you can accommodate but at the price of reintroducing the problems you are trying to prevent. (Dependently typed languages work better for this, but I suspect the result is rather more draconian than you intend.)
You needn't solve the halting problem. Any reasonable total language can prevent this case and there is no need to keep types as simple as these around at runtime. See the Agda total parser combinator library for an example, but yes, dependent types are overkill. I usually enforce constraints like this with ! patterns in the constructors, which lets me enforce the fact that at least I know that any attempt to define a cycle like this will bottom out, so I can safely think only inductive thoughts from there out. Another way is to just define a template Haskell regex quasi quoter, and enforce the limitation there. That of course costs you runtime generation. -Edward Kmett

On Wed, Sep 21, 2011 at 20:05, Edward Kmett
I usually enforce constraints like this with ! patterns in the constructors, which lets me enforce the fact that at least I know that any attempt to define a cycle like this will bottom out, so I can safely think only inductive thoughts from there out.
But strictness annotations act at runtime (assignment time); he wants *compile* time, which puts him into more interesting type territory. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

* Anton Tayanovskyy
So you want to encode priorities efficiently as far as I understand from [1]? Could bit-packing combined with prefix elimination do the trick? Choice boils down to binary choice. Attach a number N to every execution thread that sits in a given NFA state. When the thread moves through a binary choice state, it splits into two threads with N_{left} = 2 * N + 1 and N_{right} = 2 * N. When two threads arrive at the same NFA state, the machine picks one with greater N and discards the other, thus giving priority to early over late and left over right. This makes these numbers grow quickly, but at any point in the simulation one can identify the common binary prefix of all the thread numbers and remove it, because it will not be relevant for comparison. If this is too costly, amortize and do it every K steps instead of on every step.
Anton
[1] https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help
Just common prefix elimination is not sufficient to put a hard bound on the memory usage. For example, something like /(a*)|((aa)*)/ would still require an amount of space linear in the string length (if the string consists of many a's). More clever techniques that I tried ("monotonically map lists to smaller lists") are hard to make correct. I like the approach of Russ Cox[2]. One of the great ideas there (which I think he didn't emphasize enough) is to have a queue which allows O(1) testing whether an element is already there [3]. This solves the problem with priorities -- the states are guaranteed to be enqueued in the order of their priorities, and there are no repetitions. To be able to process the states in the proper order I'll have to give up Glushkov automaton and probably use something similar to your construction [4]. [2] http://swtch.com/~rsc/regexp/regexp2.html [3] http://code.google.com/p/re2/source/browse/util/sparse_array.h [4] http://t0yv0.blogspot.com/2011/07/combinatory-regular-expressions-in.html -- Roman I. Cheplyaka :: http://ro-che.info/

I like the approach of Russ Cox[2]. One of the great ideas there (which I think he didn't emphasize enough) is to have a queue which allows O(1) testing whether an element is already there [3]. This solves the problem with priorities -- the states are guaranteed to be enqueued in the order of their priorities, and there are no repetitions.
Hm, this sounds great! So then the number of states is just the size of the NFA, so the memory does not depend on the input string length? Have you tried this approach yet? I wouldn't vouch for my code in that gist, I kind of remember trying to eliminate the duplicates while preserving order buy not sure if I did it correctly there.
To be able to process the states in the proper order I'll have to give up Glushkov automaton and probably use something similar to your construction [4].
[2] http://swtch.com/~rsc/regexp/regexp2.html [3] http://code.google.com/p/re2/source/browse/util/sparse_array.h [4] http://t0yv0.blogspot.com/2011/07/combinatory-regular-expressions-in.html
-- Roman I. Cheplyaka :: http://ro-che.info/
-- Kind Regards, Anton Tayanovskyy WebSharper™ - type-safe JavaScript in F# http://intellifactory.com
participants (7)
-
Anton Tayanovskyy
-
Brandon Allbery
-
Chris Kuklewicz
-
Chris Smith
-
Edward Kmett
-
Eugene Kirpichov
-
Roman Cheplyaka