Fastest regex package?

Hello. What is the fastest regex package of the multitude of packages present on Hackage (namely, PCRE, PCRE-light, DFA, TDFA and many more)? My benchmark (parsing a huge logfile with a regex like "GET /foo.xml.*fooid=([0-9]++).*barid=([0-9]++)") shows that plain PCRE is the fastest one (I tried PCRE, PCRE-light and TDFA; DFA can't do capturing groups at all, TDFA was abysmally slow (about 20x slower than PCRE), and it doesn't support ++), but maybe have I missed any blazing-fast package? Even PCRE seems somewhat too slow to me - matching this regex eats 50% of the program time; considering that the only things it does is ungzipping the log and searching for the regex, this seems like far too much, the regex is not that complex to my mind. I tried to replace this regex by Data.ByteString.Search.KnuthMorrisPratt and BoyerMoore from stringsearch, but they work even twice slower. I also noticed that PCRE does about 2x fewer allocations than PCRE-Light, but is not much faster. However, PCRE, strangely, has a non-pure interface, I had to use it with unsafePerformIO. It's a pity I can't put these findings on haskellwiki to http://www.haskell.org/haskellwiki/Regular_expressions since I don't have an account and new accounts can't be created right now. Could anyone put them there? All in all, my question remains: what is the fastest way to do this kind of parsing on a lazy bytestring?

I was wrong about BoyerMoore: it works about 2x faster than PCRE for
this example, so seems like it does the trick (although I suspect that
one could do even faster, so the question remains).
2009/2/5 Eugene Kirpichov
Hello.
What is the fastest regex package of the multitude of packages present on Hackage (namely, PCRE, PCRE-light, DFA, TDFA and many more)?
My benchmark (parsing a huge logfile with a regex like "GET /foo.xml.*fooid=([0-9]++).*barid=([0-9]++)") shows that plain PCRE is the fastest one (I tried PCRE, PCRE-light and TDFA; DFA can't do capturing groups at all, TDFA was abysmally slow (about 20x slower than PCRE), and it doesn't support ++), but maybe have I missed any blazing-fast package?
Even PCRE seems somewhat too slow to me - matching this regex eats 50% of the program time; considering that the only things it does is ungzipping the log and searching for the regex, this seems like far too much, the regex is not that complex to my mind. I tried to replace this regex by Data.ByteString.Search.KnuthMorrisPratt and BoyerMoore from stringsearch, but they work even twice slower.
I also noticed that PCRE does about 2x fewer allocations than PCRE-Light, but is not much faster. However, PCRE, strangely, has a non-pure interface, I had to use it with unsafePerformIO.
It's a pity I can't put these findings on haskellwiki to http://www.haskell.org/haskellwiki/Regular_expressions since I don't have an account and new accounts can't be created right now. Could anyone put them there?
All in all, my question remains: what is the fastest way to do this kind of parsing on a lazy bytestring?

Eugene Kirpichov wrote:
All in all, my question remains: what is the fastest way to do this kind of parsing on a lazy bytestring?
Your example regular expression works the same in both Posix and Perl-ish semantics. Do you know the difference? Posix libraries look for the longest match of all possible matches. Perl-ish is left-bias and looks at the left branch first and only looks at the right branch when the left fails and it has to backtrack. So the "++" operator is a hack to try and control the backtracking of Perl regular expressions. Such a things has no meaning in Posix where the implementation details are totally different. You might try this variant of the example pattern: /foo.xml.*fooid=([0-9]+)[^0-9].*barid=([0-9]+) The [^0-9] can be used when you know that there is at least one junk character before the barid, which I suspect will always occur in a URL. I expect regex-posix to be slower than regex-pcre. I have not used the new pcre-light. I wrote regex-tdfa — it is pure haskell and not a C library wrapper. There are patterns where regex-pcre will backtrack and take exponentially more time than regex-tdfa's automaton (which is not yet ideal and may get faster). So what is the lazy bytestring with its multiple buffers doing for you when using PCRE, PCRE-light, or regex-posix? Absolutely nothing. To run against these C libraries the target text is converted to a single buffer, i.e. a CStringLen in Haskell. Thus it is morally converted into a strict bytestring. This may involve copying the logfile into a NEW strict bytestring EVERY TIME you run a match. Please Please Please convert to a strict bytestring and then run regex-pcre or pcre-light (or any of the others). regex-tdfa does not convert it into a strict bytestring, but is otherwise much slower than pcre for your simple pattern. As for regex-pcre's interface....you should use the API in regex-base to get a pure interface. The RegexLike functions are the pure interface for this, and the RegexContext class offers a slew of instances with useful variants. But if you have been getting to the "low level IO API" in regex-pcre then you probably do not need or want the RegexContext transformations. And BoyerMoore (which I think I helped optimize): this may be faster because it does not copy your whole Lazy bytestring into a Strict ByteString for each search. But you may wish to test it with a Strict ByteString as input anyway. -- Chris

2009/2/5 ChrisK
Eugene Kirpichov wrote:
All in all, my question remains: what is the fastest way to do this kind of parsing on a lazy bytestring?
Your example regular expression works the same in both Posix and Perl-ish semantics. Do you know the difference? Posix libraries look for the longest match of all possible matches. Perl-ish is left-bias and looks at the left branch first and only looks at the right branch when the left fails and it has to backtrack.
Thanks, I didn't know that about posix.
So the "++" operator is a hack to try and control the backtracking of Perl regular expressions. Such a things has no meaning in Posix where the implementation details are totally different.
That's precisely what I put it in for :)
You might try this variant of the example pattern: /foo.xml.*fooid=([0-9]+)[^0-9].*barid=([0-9]+)
The [^0-9] can be used when you know that there is at least one junk character before the barid, which I suspect will always occur in a URL.
I expect regex-posix to be slower than regex-pcre. I have not used the new pcre-light. I wrote regex-tdfa — it is pure haskell and not a C library wrapper. There are patterns where regex-pcre will backtrack and take exponentially more time than regex-tdfa's automaton (which is not yet ideal and may get faster).
You're right :)
So what is the lazy bytestring with its multiple buffers doing for you when using PCRE, PCRE-light, or regex-posix? Absolutely nothing. To run against these C libraries the target text is converted to a single buffer, i.e. a CStringLen in Haskell. Thus it is morally converted into a strict bytestring. This may involve copying the logfile into a NEW strict bytestring EVERY TIME you run a match. Please Please Please convert to a strict bytestring and then run regex-pcre or pcre-light (or any of the others).
I'm running the regex on lines of the logfile, not on the whole logfile itself; and I'm running it only once per line, so that doesn't matter much. Actually, that's what I did in the initial version of the program (strictify lines before matching), but then I experimentally got rid of the extra conversion function and performance didn't change at all (anyways, someone was converting that string exactly once, be it me or the package), so I left it that way.
regex-tdfa does not convert it into a strict bytestring, but is otherwise much slower than pcre for your simple pattern.
As for regex-pcre's interface....you should use the API in regex-base to get a pure interface. The RegexLike functions are the pure interface for this, and the RegexContext class offers a slew of instances with useful variants.
Thanks, I didn't know that.
But if you have been getting to the "low level IO API" in regex-pcre then you probably do not need or want the RegexContext transformations.
Is that because of their performance?
And BoyerMoore (which I think I helped optimize): this may be faster because it does not copy your whole Lazy bytestring into a Strict ByteString for each search. But you may wish to test it with a Strict ByteString as input anyway.
-- Chris

On 2009 Feb 5, at 10:26, Eugene Kirpichov wrote:
My benchmark (parsing a huge logfile with a regex like "GET /foo.xml.*fooid=([0-9]++).*barid=([0-9]++)") shows that plain PCRE is the fastest one (I tried PCRE, PCRE-light and TDFA; DFA can't do capturing groups at all, TDFA was abysmally slow (about 20x slower than PCRE), and it doesn't support ++), but maybe have I missed any blazing-fast package?
I think dons (copied) will want to hear about this; pcre-light is supposed to be a fast lightweight wrapper for the PCRE library, and if it's slower than PCRE then something is likely to be wrong somewhere. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

allbery:
On 2009 Feb 5, at 10:26, Eugene Kirpichov wrote:
My benchmark (parsing a huge logfile with a regex like "GET /foo.xml.*fooid=([0-9]++).*barid=([0-9]++)") shows that plain PCRE is the fastest one (I tried PCRE, PCRE-light and TDFA; DFA can't do capturing groups at all, TDFA was abysmally slow (about 20x slower than PCRE), and it doesn't support ++), but maybe have I missed any blazing-fast package?
I think dons (copied) will want to hear about this; pcre-light is supposed to be a fast lightweight wrapper for the PCRE library, and if it's slower than PCRE then something is likely to be wrong somewhere.
Shouldn't be slower (assuming you're using bytestrings). -- Don
participants (4)
-
Brandon S. Allbery KF8NH
-
ChrisK
-
Don Stewart
-
Eugene Kirpichov