Fastest regex package?

5 Feb 2009

      Hello.

What is the fastest regex package of the multitude of packages present
on Hackage (namely, PCRE, PCRE-light, DFA, TDFA and many more)?

My benchmark (parsing a huge logfile with a regex like "GET
/foo.xml.*fooid=([0-9]++).*barid=([0-9]++)") shows that plain PCRE is
the fastest one (I tried PCRE, PCRE-light and TDFA; DFA can't do
capturing groups at all, TDFA was abysmally slow (about 20x slower
than PCRE), and it doesn't support ++), but maybe have I missed any
blazing-fast package?

Even PCRE seems somewhat too slow to me - matching this regex eats 50%
of the program time; considering that the only things it does is
ungzipping the log and searching for the regex, this seems like far
too much, the regex is not that complex to my mind. I tried to replace
this regex by Data.ByteString.Search.KnuthMorrisPratt and BoyerMoore
from stringsearch, but they work even twice slower.

I also noticed that PCRE does about 2x fewer allocations than
PCRE-Light, but is not much faster.
However, PCRE, strangely, has a non-pure interface, I had to use it
with unsafePerformIO.

It's a pity I can't put these findings on haskellwiki to
http://www.haskell.org/haskellwiki/Regular_expressions since I don't
have an account and new accounts can't be created right now. Could
anyone put them there?

All in all, my question remains: what is the fastest way to do this
kind of parsing on a lazy bytestring?

Eugene Kirpichov

Eugene Kirpichov

ChrisK

Eugene Kirpichov

Brandon S. Allbery KF8NH

Don Stewart

tags

participants (4)