Speed of character reading in Haskell

I'm writing a tokeniser for a legacy programming language in Haskell. I started with the obvious main = getContents >>= print . tokenise where tokenise maps its way down a list of characters. This is very simple, very pleasant, and worked like a charm. However, the language has an INCLUDE directive, so I'm going to have to call readFile or something in the middle of tokenising, so the main tokeniser loop can't be a pure String -> [Token] function any more. It occurred to me to wonder whether it would be better to try to stick with the list of characters style as much as possible (and use readFile) or to switch over to character at a time processing (with openFile, hIsEOF, hGetChar, hClose). So I decided on an experiment: count the lines both ways. Method 1A (pure list processing) main = getContents >>= print . doit 0 doit n ('\n':cs) = doit (n+1) cs doit n ( _ :cs) = doit n cs doit n [] = n Method 1B (doit is in the IO monad) main = getContents >>= doit 0 doit n ('\n':cs) = doit (n+1) cs doit n ( _ :cs) = doit n cs doit n [] = print n Method 2 (character at a time) main = doit 0 doit n = isEOF >>= \eof -> if eof then print n else getChar >>= \c -> doit (if c == '\n' then n+1 else n) I tried this using GHC 6.4 on a 2.8 GHz PC running linux 2.6.20-1.2933.fc6 I don't know whether gcc was involved at all, but in case it was, the gcc version was 4.1.1. All test programs were compiled using ghc -o foobar -O foobar.hs The input was a 3.8 MB file. wc: 0.08 seconds total Method 1A: 0.30 seconds total Method 1B: 0.47 seconds total Method 2: ran out of stack Method 2': 2.21 seconds total. When I got the stack overflow, I suspected that this might be due to lazy evaluation of the integer argument to doit, so I changed the last line to doit $! (if c == '\n' then n+1 else n) and that worked. In *retrospect*, it is really obvious why this was necessary, but I must say that in *prospect* I wasn't expecting it. Other things being equal, I would have expected Method 2' to be faster. I'm now wondering if there is some other consequence of laziness that I am just not seeing. I'm also a little puzzled that there is a measurable difference between method 1A and method 1B. Things that I am not asking about: - ghc version; that's what's installed on the machine and if I want something different I may have to wait many days. - style comments on the list processing versions; yes I know how to write getContents >>= print . length . filter (=='\n') and that really has no relevance to the full tokeniser and in any case is slower than Method 1A.
participants (1)
-
ok