[Haskell-cafe] Re: Fastest regex package?

7 Feb 2009


      2009/2/5 ChrisK :
...
Eugene Kirpichov wrote:
...
...
All in all, my question remains: what is the fastest way to do this
kind of parsing on a lazy bytestring?
Your example regular expression works the same in both Posix and Perl-ish
semantics.   Do you know the difference?  Posix libraries look for the
longest match of all possible matches.  Perl-ish is left-bias and looks at
the left branch first and only looks at the right branch when the left fails
and it has to backtrack.
Thanks, I didn't know that about posix.
...
So the "++" operator is a hack to try and control the backtracking of Perl
regular expressions.  Such a things has no meaning in Posix where the
implementation details are totally different.
That's precisely what I put it in for :)
...
You might try this variant of the example pattern:
 /foo.xml.*fooid=([0-9]+)[^0-9].*barid=([0-9]+)
The [^0-9] can be used when you know that there is at least one junk
character before the barid, which I suspect will always occur in a URL.
I expect regex-posix to be slower than regex-pcre. I have not used the new
pcre-light.  I wrote regex-tdfa — it is pure haskell and not a C library
wrapper.  There are patterns where regex-pcre will backtrack and take
exponentially more time than regex-tdfa's automaton (which is not yet ideal
and may get faster).
You're right :)
...
So what is the lazy bytestring with its multiple buffers doing for you when
using PCRE, PCRE-light, or regex-posix? Absolutely nothing.  To run against
these C libraries the target text is converted to a single buffer, i.e. a
CStringLen in Haskell.  Thus it is morally converted into a strict
bytestring. This may involve copying the logfile into a NEW strict
bytestring EVERY TIME you run a match.  Please Please Please convert to a
strict bytestring and then run regex-pcre or pcre-light (or any of the
others).
I'm running the regex on lines of the logfile, not on the whole
logfile itself; and I'm running it only once per line, so that doesn't
matter much.
Actually, that's what I did in the initial version of the program
(strictify lines before matching), but then I experimentally got rid
of the extra conversion function and performance didn't change at all
(anyways, someone was converting that string exactly once, be it me or
the package), so I left it that way.
...
regex-tdfa does not convert it into a strict bytestring, but is otherwise
much slower than pcre for your simple pattern.
As for regex-pcre's interface....you should use the API in regex-base to get
a pure interface.   The RegexLike functions are the pure interface for this,
and the RegexContext class offers a slew of instances with useful variants.
Thanks, I didn't know that.
...
But if you have been getting to the "low level IO API" in regex-pcre then
you probably do not need or want the RegexContext transformations.
Is that because of their performance?
...
And BoyerMoore (which I think I helped optimize): this may be faster because
it does not copy your whole Lazy bytestring into a Strict ByteString for
each search.  But you may wish to test it with a Strict ByteString as input
anyway.
--
Chris