
I noticed something about ByteStream performance that I don't understand. I have a test text document: $ ls -sh test-text-file 956K test-text-file Running this program, using the Prelude's IO functions:
module Main where
main = do content <- readFile "test-text-file" let l = length . words $ content print l
I get: $ time ./a.out 174372 real 0m0.805s user 0m0.720s sys 0m0.008s Running a version of the same thing using Data.ByteStream.Char8:
module Main where
import qualified Data.ByteString.Char8 as B
main = do content <- B.readFile "test-text-file" let l = length . B.words $ content print l
I see a time that is quite a bit slower: $ time ./a.out 174372 real 0m1.864s user 0m1.596s sys 0m0.012s Changing it to incorporate similar code to the implementation of B.words:
module Main where
import qualified Data.ByteString.Char8 as B import Data.Char (isSpace)
main = do content <- B.readFile "test-text-file" let l = length $ filter (not . B.null) $ B.splitWith isSpace content print l
I see a similar time as with B.words: $ time ./a.out 174372 real 0m1.835s user 0m1.628s sys 0m0.012s And then if I change this to use B.split ' ' instead of isSpace:
module Main where
import qualified Data.ByteString.Char8 as B
main = do content <- B.readFile "test-text-file" let l = length $ filter (not . B.null) $ B.split ' ' content print l
I get a time that's much more reasonable-looking, compared to the original Prelude.words version: $ time ./a.out 174313 real 0m0.389s user 0m0.312s sys 0m0.004s It seems like the B.splitWith isSpace code is really slow for some reason. Anybody have any idea what's going on? The actual implementation is using isSpaceWord8 which is a case statement looking for a pile of different whitespace characters. -- .~. Dino Morelli /V\ email: dino@ui3.info /( )\ irc: dino- ^^-^^ preferred distro: Debian GNU/Linux http://www.debian.org