Strings are slow

19 Nov 2002

      Hello. Here's a grep program:

module Main where

import Text.Regex
import System.IO
import System.Environment
import Control.Monad
import Data.Maybe

main = do [re] <- getArgs
	  let rx = mkRegex re
	  let loop = do line <- getLine
			when (isJust (matchRegex rx line)) (putStrLn line)
			eof <- isEOF
			unless eof loop
	  loop

It turned out that this is remarkably slow. The first problem was with
inlining. If this is compiled with ghc-5.04 -O -ddump-simpl, I get:

	      case GHC.IOBase.unsafePerformIO
		     @ (Data.Maybe.Maybe
			    (GHC.Base.String,
			     GHC.Base.String,
			     GHC.Base.String,
			     [GHC.Base.String]))
		     (Text.Regex.Posix.regexec (Text.Regex.mkRegex re) a731)
	      of wild4 {

Ie. the regex is compiled anew every time a string is matched. A bug?

Anyway, without optimization the code produced is reasonable, but still
horrendously slow. Testing with a simple word as a pattern from a 7.3MB,
800kline file, the running time was 37.5 seconds. For comparison, a similar
program in mzscheme (interpreted!) took 7.3 seconds while the system
grep, of course, took 0.4 seconds.

I did some profiling by creating new top-level bindings for matchRegex
and getLine (is there a better way?):

        total time  =       53.34 secs   (2667 ticks @ 20 ms)
        total alloc = 1,172,482,496 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

match                          Main                  69.7   56.8
getl                           Main                  23.9   40.2
main                           Main                   6.3    3.0

So it seems like all the time is spent just converting ByteArrays to
char lists to C arrays. This makes me wonder how sensible it really is
to represent strings by char lists. Yes, it's nice and uniform and lazy,
but...

How can I get this faster, then? PackedStrings are not very useful
because they just don't support enough operations (getline, matchregex)
alone, and having to convert them to Strings sort of defeats their
purpose. _Any_ operation that provides only a String interface condemns
us to a gazillion allocations. And by default all char*-based foreign
interfaces are represented with Strings on the Haskell side.

Maybe a generic Textual class with at least String and PackedString (and
ByteArray?) as instances would help? Then the common string-based
operations could all have separate implementations for separate
representations. With heavy specialization, of course. This would be
especially useful if the FFI (especially withCString) supported it.

Or alternatively, maybe the foldr/build rewriting trick could be used to
eliminate some redundant conversions between representations?

Just throwing ideas in the air here.

Lauri Alanko
la@iki.fi