
skaller wrote:
On Mon, 2006-08-07 at 20:38 +0100, Chris Kuklewicz wrote:
skaller wrote:
Wouldn't it be nice to use Ville Laurikari's TRE package instead of PCRE?
[It is also Posix compliant and drop in replacement for gnu regex .. as well as supporting nice extensions]
It is possible to add support for more backends. The more the merrier, no need to replace anything. I have never heard of TRE before.
TRE is actually based on mathematics.
I have added TRE as a new backend. (I have libtre-0.7.4 on OS X, it is LGPL). And I have a benchmark, which I will now copy and paste. On "b(..?c)?d": PCRE is fast, TRE is almost as fast, the DFA (without subexpression capture) is a close 3rd. The Parsec backend is about 4 times slower, and the Posix library scales poorly, being far slower on large data sets. The code is in darcs http://evenmere.org/~chrisk/trl/head/ under regex-devel/bench Benchmark: find all matches and substring capture for "b(..?c)?d" in a one million character string on disk. Print the count, the first, and the last match (with captures). The benchmark program was, for String:
module Main(main) where
import Text.Regex.XXX
filename = "datafile" regex = "b(..?c)?d" main = do input <- readFile filename let a :: [[String]] a = input =~ regex b :: Int b = length a print (b,head a,last a)
and for ByteString:
module Main(main) where
import Text.Regex.XXX import qualified Data.ByteString as B
default (Int)
filename = "datafile" regex = "b(..?c)?d"
main = do input <- B.readFile filename let a :: [[B.ByteString]] a = input =~ regex b :: Int b = length a print (b,head a,last a)
where XXX was replaces by PCRE, Parsec, DFA, PosixRE, TRE and compile with "ghc -O2" Data file is 10^6 characters from permutations of the set "abcdbcdcdd\n" Using the 10^6 character datafile and String or ByteString. (The DFS uses a different semantics so a dot matches a newline, so the matching is different.) The output and user+sys reported by the time command: BenchPCRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 1.294s BenchTRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 2.128s BenchDFA (107811,["bcdcd"],["bbccd"]) total is 2.313s BenchParsec (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 8.094s BenchPosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 91.435s BenchBSPCRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 0.932s BenchBSTRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 1.297s BenchBSDFA (107811,["bcdcd"],["bbccd"]) total is 1.437s BenchBSParsec (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 8.496s BenchBSPosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 89.780 For 10^5 characters on String: PCRE 0.077s DFA 0.131s TRE 0.206s PosixRE 0.445s Parsec 0.825s Old Posix 43.760s (Text.Regex using splitRegex) Old Text.Regex took 43.76 seconds on 10^5 characters to do a task comparable to the one the others did (it ran splitRegex). The new PosixRE wrapping took 0.445 seconds instead. Yes it is two orders of magnitude faster, and this is because my wrapping only marshals the String to CString once. Laziness cannot be worth 2 orders of magnitude of runtime. This is why we needed a new wrapping, which has grown into the new library. On a 10^7 length data set: PCRE (979261,["bcdcd","cdc"],["bd",""]) time 17.388s TRE (979261,["bcdcd","cdc"],["bd",""]) time 17.880s DFA (1063961,["bcdcd"],["bd"]) time 21.617s Parsec (979261,["bcdcd","cdc"],["bd",""]) time 87.330s BenchBSPCRE (979261,["bcdcd","cdc"],["bd",""]) time 8.322s BenchBSTRE (979261,["bcdcd","cdc"],["bd",""]) time 12.644s BenchBSDFA (1063961,["bcdcd"],["bd"]) time 14.115s BenchBSParsec (979261,["bcdcd","cdc"],["bd",""]) time 83.395s

On 09 August 2006 15:14, Chris Kuklewicz wrote:
For 10^5 characters on String: PCRE 0.077s DFA 0.131s TRE 0.206s PosixRE 0.445s Parsec 0.825s Old Posix 43.760s (Text.Regex using splitRegex)
Old Text.Regex took 43.76 seconds on 10^5 characters to do a task comparable to the one the others did (it ran splitRegex). The new PosixRE wrapping took 0.445 seconds instead. Yes it is two orders of magnitude faster, and this is because my wrapping only marshals the String to CString once. Laziness cannot be worth 2 orders of magnitude of runtime. This is why we needed a new wrapping, which has grown into the new library.
Right, I see the problem with Text.Regex.splitRegex, it repeatedly packs the String into a CString. But then why this result:
BenchPCRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 1.294s .. etc. ... BenchPosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 91.435s
Was this the old Posix, or your new one? If the new one, why is it so slow compared to the others? Cheers, Simon

Simon Marlow wrote:
On 09 August 2006 15:14, Chris Kuklewicz wrote:
For 10^5 characters on String: PCRE 0.077s DFA 0.131s TRE 0.206s PosixRE 0.445s Parsec 0.825s Old Posix 43.760s (Text.Regex using splitRegex)
Old Text.Regex took 43.76 seconds on 10^5 characters to do a task comparable to the one the others did (it ran splitRegex). The new PosixRE wrapping took 0.445 seconds instead. Yes it is two orders of magnitude faster, and this is because my wrapping only marshals the String to CString once. Laziness cannot be worth 2 orders of magnitude of runtime. This is why we needed a new wrapping, which has grown into the new library.
Right, I see the problem with Text.Regex.splitRegex, it repeatedly packs the String into a CString. But then why this result:
BenchPCRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 1.294s .. etc. ... BenchPosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) total is 91.435s
Was this the old Posix, or your new one? If the new one, why is it so slow compared to the others?
Cheers, Simon
Your question has prompted me to go back into my PosixRE wrapping code and compare it to the PCRE code. I have made some changes which ought to enhance the performance of the PosixRE code. Let us see the new bechmarks on 10^6 bytes: PosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) real 1m35.429s user 1m17.862s sys 0m1.455s total is 79.317s PCRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) real 0m2.570s user 0m1.702s sys 0m0.219s total is 1.921s BenchBSPosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) real 1m32.267s user 1m16.494s sys 0m1.374s total is 77.868s BenchBSPCRE (102363,["bcdcd","cdc"],["bbccd","bcc"]) real 0m1.245s user 0m0.809s sys 0m0.110s total is 0.919s So there was only a little improvement to the previous PosixRE speed. If you want to look at the code, it is in the three Wrap.hsc for regex-posix and regex-tre and regex-pcre for the wrapMatchAll functions. But it appears to be a library issue, not a Haskell issue. I will tend to the Haddock cleanup next. -- Chris

Chris Kuklewicz wrote:
Your question has prompted me to go back into my PosixRE wrapping code and compare it to the PCRE code. I have made some changes which ought to enhance the performance of the PosixRE code. Let us see the new bechmarks on 10^6 bytes:
PosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
real 1m35.429s user 1m17.862s sys 0m1.455s
total is 79.317s
PCRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
real 0m2.570s user 0m1.702s sys 0m0.219s
total is 1.921s
So I still don't understand why PCRE should be 40 times faster than PosixRE. Surely this can't be just due to differences in the underlying C library? Cheers, Simon

simonmarhaskell:
Chris Kuklewicz wrote:
Your question has prompted me to go back into my PosixRE wrapping code and compare it to the PCRE code. I have made some changes which ought to enhance the performance of the PosixRE code. Let us see the new bechmarks on 10^6 bytes:
PosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
real 1m35.429s user 1m17.862s sys 0m1.455s
total is 79.317s
PCRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
real 0m2.570s user 0m1.702s sys 0m0.219s
total is 1.921s
So I still don't understand why PCRE should be 40 times faster than PosixRE. Surely this can't be just due to differences in the underlying C library?
It could be. The C regex.h is pretty slow. http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all -- Don

Donald Bruce Stewart wrote:
simonmarhaskell:
Chris Kuklewicz wrote:
Your question has prompted me to go back into my PosixRE wrapping code and compare it to the PCRE code. I have made some changes which ought to enhance the performance of the PosixRE code. Let us see the new bechmarks on 10^6 bytes:
PosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
real 1m35.429s user 1m17.862s sys 0m1.455s
total is 79.317s
PCRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
real 0m2.570s user 0m1.702s sys 0m0.219s
total is 1.921s So I still don't understand why PCRE should be 40 times faster than PosixRE. Surely this can't be just due to differences in the underlying C library?
It could be. The C regex.h is pretty slow.
http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all
-- Don
And I notice c++ (g++) gets away with a 3rd party library from boost:
// This implementation of regexdna does not use the POSIX regex // included with the GNU libc. Instead it uses the Boost C++ libraries // // http://www.boost.org/libs/regex/doc/index.html // // (On Debian: apt-get install libboost-regex-dev before compiling, // and then "g++ -O3 -lboost_regex regexdna.cc -o regexdna // Gentoo seems to package boost as, well, 'boost')
Which is a strange precedent. -- Chris

Chris Kuklewicz
// (On Debian: apt-get install libboost-regex-dev before compiling, // and then "g++ -O3 -lboost_regex regexdna.cc -o regexdna // Gentoo seems to package boost as, well, 'boost')
Which is a strange precedent.
Hmm.. I have a version of KNucleotide that uses a compressed index from a C library, and which I think beat all other entries by a healthy margin. I dismissed this, since it is using non-standard libraries, but perhaps it is allowed after all? -k -- If I haven't seen further, it is by standing in the footprints of giants
participants (5)
-
Chris Kuklewicz
-
dons@cse.unsw.edu.au
-
Ketil Malde
-
Simon Marlow
-
Simon Marlow