
2009/2/5 ChrisK
Eugene Kirpichov wrote:
All in all, my question remains: what is the fastest way to do this kind of parsing on a lazy bytestring?
Your example regular expression works the same in both Posix and Perl-ish semantics. Do you know the difference? Posix libraries look for the longest match of all possible matches. Perl-ish is left-bias and looks at the left branch first and only looks at the right branch when the left fails and it has to backtrack.
Thanks, I didn't know that about posix.
So the "++" operator is a hack to try and control the backtracking of Perl regular expressions. Such a things has no meaning in Posix where the implementation details are totally different.
That's precisely what I put it in for :)
You might try this variant of the example pattern: /foo.xml.*fooid=([0-9]+)[^0-9].*barid=([0-9]+)
The [^0-9] can be used when you know that there is at least one junk character before the barid, which I suspect will always occur in a URL.
I expect regex-posix to be slower than regex-pcre. I have not used the new pcre-light. I wrote regex-tdfa — it is pure haskell and not a C library wrapper. There are patterns where regex-pcre will backtrack and take exponentially more time than regex-tdfa's automaton (which is not yet ideal and may get faster).
You're right :)
So what is the lazy bytestring with its multiple buffers doing for you when using PCRE, PCRE-light, or regex-posix? Absolutely nothing. To run against these C libraries the target text is converted to a single buffer, i.e. a CStringLen in Haskell. Thus it is morally converted into a strict bytestring. This may involve copying the logfile into a NEW strict bytestring EVERY TIME you run a match. Please Please Please convert to a strict bytestring and then run regex-pcre or pcre-light (or any of the others).
I'm running the regex on lines of the logfile, not on the whole logfile itself; and I'm running it only once per line, so that doesn't matter much. Actually, that's what I did in the initial version of the program (strictify lines before matching), but then I experimentally got rid of the extra conversion function and performance didn't change at all (anyways, someone was converting that string exactly once, be it me or the package), so I left it that way.
regex-tdfa does not convert it into a strict bytestring, but is otherwise much slower than pcre for your simple pattern.
As for regex-pcre's interface....you should use the API in regex-base to get a pure interface. The RegexLike functions are the pure interface for this, and the RegexContext class offers a slew of instances with useful variants.
Thanks, I didn't know that.
But if you have been getting to the "low level IO API" in regex-pcre then you probably do not need or want the RegexContext transformations.
Is that because of their performance?
And BoyerMoore (which I think I helped optimize): this may be faster because it does not copy your whole Lazy bytestring into a Strict ByteString for each search. But you may wish to test it with a Strict ByteString as input anyway.
-- Chris