Speed of character reading in Haskell

7 Sep 2007

      I'm writing a tokeniser for a legacy programming language in Haskell.
I started with the obvious
	main = getContents >>= print . tokenise
where tokenise maps its way down a list of characters.  This is very
simple, very pleasant, and worked like a charm.

However, the language has an INCLUDE directive, so I'm going to have
to call readFile or something in the middle of tokenising, so the
main tokeniser loop can't be a pure String -> [Token] function any
more.  It occurred to me to wonder whether it would be better to try
to stick with the list of characters style as much as possible (and
use readFile) or to switch over to character at a time processing
(with openFile, hIsEOF, hGetChar, hClose).

So I decided on an experiment:  count the lines both ways.

Method 1A (pure list processing)

     main = getContents >>= print . doit 0

     doit n ('\n':cs) = doit (n+1) cs
     doit n ( _  :cs) = doit  n    cs
     doit n []        = n

Method 1B (doit is in the IO monad)

     main = getContents >>= doit 0

     doit n ('\n':cs) = doit (n+1) cs
     doit n ( _  :cs) = doit  n    cs
     doit n []        = print n

Method 2 (character at a time)

     main = doit 0

     doit n = isEOF >>= \eof ->
              if eof then print n else
              getChar >>= \c ->
              doit (if c == '\n' then n+1 else n)

I tried this using GHC 6.4 on a 2.8 GHz PC running linux  
2.6.20-1.2933.fc6
I don't know whether gcc was involved at all, but in case it was, the  
gcc
version was 4.1.1.  All test programs were compiled using
	ghc -o foobar -O foobar.hs

The input was a 3.8 MB file.

wc:	    0.08 seconds total
Method 1A:  0.30 seconds total
Method 1B:  0.47 seconds total
Method 2:   ran out of stack
Method 2':  2.21 seconds total.

When I got the stack overflow, I suspected that this might be due to
lazy evaluation of the integer argument to doit, so I changed the last
line to
              doit $! (if c == '\n' then n+1 else n)
and that worked.  In *retrospect*, it is really obvious why this was
necessary, but I must say that in *prospect* I wasn't expecting it.

Other things being equal, I would have expected Method 2' to be
faster.  I'm now wondering if there is some other consequence of
laziness that I am just not seeing.

I'm also a little puzzled that there is a measurable difference
between method 1A and method 1B.

Things that I am not asking about:
  - ghc version; that's what's installed on the machine and if I want
    something different I may have to wait many days.
  - style comments on the list processing versions; yes I know how to
    write getContents >>= print . length . filter (=='\n') and that
    really has no relevance to the full tokeniser and in any case is
    slower than Method 1A.

ok

tags

participants (1)