
module Main where
import Text.Regex import System.IO import System.Environment import Control.Monad import Data.Maybe
main = do [re] <- getArgs let rx = mkRegex re let loop = do line <- getLine when (isJust (matchRegex rx line)) (putStrLn line) eof <- isEOF unless eof loop loop
It turned out that this is remarkably slow. The first problem was with inlining. If this is compiled with ghc-5.04 -O -ddump-simpl, I get:
case GHC.IOBase.unsafePerformIO @ (Data.Maybe.Maybe (GHC.Base.String, GHC.Base.String, GHC.Base.String, [GHC.Base.String])) (Text.Regex.Posix.regexec (Text.Regex.mkRegex re) a731) of wild4 {
This is indeed an optimiser bug, but it's the result of a design decision: GHC is a bit laid back about inlining things inside the state lambda in the IO monad, because it often enables important optimisations. However, we're experimenting with modifying this "optimisation" so that it will be less likely to kill performance in the way it did in your example. In the meantime, you can add rx as an argument to loop, that will be enough to fool GHC into not inlining rx.
Ie. the regex is compiled anew every time a string is matched. A bug?
Anyway, without optimization the code produced is reasonable, but still horrendously slow. Testing with a simple word as a pattern from a 7.3MB, 800kline file, the running time was 37.5 seconds. For comparison, a similar program in mzscheme (interpreted!) took 7.3 seconds while the system grep, of course, took 0.4 seconds.
I did some profiling by creating new top-level bindings for matchRegex and getLine (is there a better way?):
total time = 53.34 secs (2667 ticks @ 20 ms) total alloc = 1,172,482,496 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
match Main 69.7 56.8 getl Main 23.9 40.2 main Main 6.3 3.0
So it seems like all the time is spent just converting ByteArrays to char lists to C arrays. This makes me wonder how sensible it really is to represent strings by char lists. Yes, it's nice and uniform and lazy, but...
String processing in Haskell is very slow, due to the list-of-characters representation. A more complete PackedString library with better integration with other libraries (like Text.Regex) would help a lot for these kind of examples. Cheers, Simon
participants (1)
-
Simon Marlow