
If you are worried about Denial Of Service attacks then you are worried about (1) memory explosion, and (2) long runtime. Long runtime can and should be solved by running a timeout connected to a thread killer, not at the level of the regular expression engine. The memory explosion is more interesting, and here the engine's tradeoffs have something to say: Since I know simple POSIX regular expression matching better, here is an overview of this without subexpression capture: (1) The full DFA may be exponentially large and thus cannot be built ahead of time. (2) Building the DFA in a lazy fashion will cause memory to grow linearly, which will be a problem if left to run too long. (2.1) Pure lazy DFA building won't have a way to determine the currently computed side of the DFA, and is not going to help you. (2.2) Cheating the "pure" with unsafePerformIO might be able to keep a count of the number of computed DFA nodes. Good luck with thread safety. (2.3) Making it a stateful expanding DFA would allow you to control and monitor the growing number of DFA states. (2.4) Adjust the stateful DFA to keep a limited cache of computed DFA nodes. (3) Building an NFA from your regular expression can be cheap, but you would have be more clever than expanding "a(b{2,99}c){100}d" to have 100 copies of 99 copies of b. (4) Using an NFA with multiple states (breadth-first) is like the limit of a DFA with only a single computed node at a time. The number of states active in the pure NFA can be easily counted and bounded to keep the system from using too much memory. Without the lazy or stateful caching of DFA states this may be slightly slower, depending on whether the caches states would be re-used heavily or not. If you want POSIX subexpression capture then your memory required go up by a factor that is, worst case, proportional to the regular expression size. This worst case factor for the "a(b{2,99}c){100}d" does have to track 100 copies of 99 copies of the tracking information. ( The size of the NFA can still be clever when adding capture, but the tracked information you now need cannot be so clever ). The Russ Cox engine in re2 uses NFA/DFA techniques to achieve Perl semantics, where it can check the possibilities in parallel. The same problems with exploding DFA size and tracked information may still apply. On 14/09/2011 07:32, Roman Cheplyaka wrote:
* Eugene Kirpichov
[2011-09-14 08:38:10+0400] Hi, I don't see how fallback to NFA simulation is really a failure wrt DoS attacks. It's not exponential in time or memory, just linear memory (in size of regex) instead of constant, and slower than DFA.
Hi Eugene, thanks for pointing that out.
Indeed, I now see that he uses breadth-first rather than depth-first search. Then he has the same problem as I do. I shall study his code more closely.