[Haskell-cafe] Data.ByteStream.Char8.words performance

31 Mar 2007

      I noticed something about ByteStream performance that I don't
understand.

I have a test text document:

    $ ls -sh test-text-file
    956K test-text-file

Running this program, using the Prelude's IO functions:
...
module Main where
main = do
   content <- readFile "test-text-file"
   let l = length . words $ content
   print l
I get:

    $ time ./a.out
    174372

    real    0m0.805s
    user    0m0.720s
    sys     0m0.008s

Running a version of the same thing using Data.ByteStream.Char8:
...
module Main where
import qualified Data.ByteString.Char8 as B
main = do
   content <- B.readFile "test-text-file"
   let l = length . B.words $ content
   print l
I see a time that is quite a bit slower:

    $ time ./a.out
    174372

    real    0m1.864s
    user    0m1.596s
    sys     0m0.012s

Changing it to incorporate similar code to the implementation of
B.words:
...
module Main where
import qualified Data.ByteString.Char8 as B
import Data.Char (isSpace)
main = do
   content <- B.readFile "test-text-file"
   let l = length $ filter (not . B.null) $ B.splitWith isSpace
   content
   print l
I see a similar time as with B.words:

    $ time ./a.out
    174372

    real    0m1.835s
    user    0m1.628s
    sys     0m0.012s

And then if I change this to use B.split ' ' instead of isSpace:
...
module Main where
import qualified Data.ByteString.Char8 as B
main = do
   content <- B.readFile "test-text-file"
   let l = length $ filter (not . B.null) $ B.split ' ' content
   print l
I get a time that's much more reasonable-looking, compared to the
original Prelude.words version:

    $ time ./a.out
    174313

    real    0m0.389s
    user    0m0.312s
    sys     0m0.004s

It seems like the B.splitWith isSpace code is really slow for some
reason. Anybody have any idea what's going on? The actual implementation
is using isSpaceWord8 which is a case statement looking for a pile of
different whitespace characters.

-- 
  .~.    Dino Morelli
  /V\    email: dino@ui3.info
/( )\   irc: dino-
^^-^^   preferred distro: Debian GNU/Linux  http://www.debian.org