
Eugene Kirpichov wrote:
All in all, my question remains: what is the fastest way to do this kind of parsing on a lazy bytestring?
Your example regular expression works the same in both Posix and Perl-ish semantics. Do you know the difference? Posix libraries look for the longest match of all possible matches. Perl-ish is left-bias and looks at the left branch first and only looks at the right branch when the left fails and it has to backtrack. So the "++" operator is a hack to try and control the backtracking of Perl regular expressions. Such a things has no meaning in Posix where the implementation details are totally different. You might try this variant of the example pattern: /foo.xml.*fooid=([0-9]+)[^0-9].*barid=([0-9]+) The [^0-9] can be used when you know that there is at least one junk character before the barid, which I suspect will always occur in a URL. I expect regex-posix to be slower than regex-pcre. I have not used the new pcre-light. I wrote regex-tdfa — it is pure haskell and not a C library wrapper. There are patterns where regex-pcre will backtrack and take exponentially more time than regex-tdfa's automaton (which is not yet ideal and may get faster). So what is the lazy bytestring with its multiple buffers doing for you when using PCRE, PCRE-light, or regex-posix? Absolutely nothing. To run against these C libraries the target text is converted to a single buffer, i.e. a CStringLen in Haskell. Thus it is morally converted into a strict bytestring. This may involve copying the logfile into a NEW strict bytestring EVERY TIME you run a match. Please Please Please convert to a strict bytestring and then run regex-pcre or pcre-light (or any of the others). regex-tdfa does not convert it into a strict bytestring, but is otherwise much slower than pcre for your simple pattern. As for regex-pcre's interface....you should use the API in regex-base to get a pure interface. The RegexLike functions are the pure interface for this, and the RegexContext class offers a slew of instances with useful variants. But if you have been getting to the "low level IO API" in regex-pcre then you probably do not need or want the RegexContext transformations. And BoyerMoore (which I think I helped optimize): this may be faster because it does not copy your whole Lazy bytestring into a Strict ByteString for each search. But you may wish to test it with a Strict ByteString as input anyway. -- Chris