Data.ByteStream.Char8.words performance

I noticed something about ByteStream performance that I don't understand. I have a test text document: $ ls -sh test-text-file 956K test-text-file Running this program, using the Prelude's IO functions:
module Main where
main = do content <- readFile "test-text-file" let l = length . words $ content print l
I get: $ time ./a.out 174372 real 0m0.805s user 0m0.720s sys 0m0.008s Running a version of the same thing using Data.ByteStream.Char8:
module Main where
import qualified Data.ByteString.Char8 as B
main = do content <- B.readFile "test-text-file" let l = length . B.words $ content print l
I see a time that is quite a bit slower: $ time ./a.out 174372 real 0m1.864s user 0m1.596s sys 0m0.012s Changing it to incorporate similar code to the implementation of B.words:
module Main where
import qualified Data.ByteString.Char8 as B import Data.Char (isSpace)
main = do content <- B.readFile "test-text-file" let l = length $ filter (not . B.null) $ B.splitWith isSpace content print l
I see a similar time as with B.words: $ time ./a.out 174372 real 0m1.835s user 0m1.628s sys 0m0.012s And then if I change this to use B.split ' ' instead of isSpace:
module Main where
import qualified Data.ByteString.Char8 as B
main = do content <- B.readFile "test-text-file" let l = length $ filter (not . B.null) $ B.split ' ' content print l
I get a time that's much more reasonable-looking, compared to the original Prelude.words version: $ time ./a.out 174313 real 0m0.389s user 0m0.312s sys 0m0.004s It seems like the B.splitWith isSpace code is really slow for some reason. Anybody have any idea what's going on? The actual implementation is using isSpaceWord8 which is a case statement looking for a pile of different whitespace characters. -- .~. Dino Morelli /V\ email: dino@ui3.info /( )\ irc: dino- ^^-^^ preferred distro: Debian GNU/Linux http://www.debian.org

Hello, Did you compile with -O2 ? That makes a huge difference when using ByteString. j.

On Fri, 30 Mar 2007, Jeremy Shaw wrote:
Hello,
Did you compile with -O2 ? That makes a huge difference when using ByteString.
j.
Ah, that was exactly it. I feel silly.
module Main where
import qualified Data.ByteString.Char8 as B
main = do content <- B.readFile "test-text-file" let l = length . B.words $ content print l
$ ghc -O2 count-b.hs $ time ./a.out 174372 real 0m0.198s user 0m0.136s sys 0m0.012s Much faster. Thank you! dino-

On Fri, 2007-03-30 at 14:24 -0700, Jeremy Shaw wrote:
Hello,
Did you compile with -O2 ? That makes a huge difference when using ByteString.
Hmm, I think we can do better than that. It would be nicer to have it work fast without needing any -O flags at all in the user's module. Lets look at the current def again: words :: ByteString -> [ByteString] words = P.filter (not . B.null) . B.splitWith isSpaceWord8 {-# INLINE words #-} So this will always inline words into your program (when using -O or -O2) however there is nothing really to be gained from doing that. There's no fusion going on here, it's always going to (lazily) allocate the result list. So I think it's probably better to just remove the inline pragma. In fact Dino's original program might work faster with -O0 than -O1. :-) The best you could do with the current definition (rather than writing a specialised implementation) is something like: words = P.filter (not . B.null) . words' {-# INLINE words #-} words' = B.splitWith isSpaceWord8 {-# NOINLINE words' #-} since the filter could fuse in the calling context with a good list consumer but the B.splitWith is not a good producer in it's current definition so there is no benefit to inlining it. All that gives you is the potential to compile it badly in the calling module rather than just calling the single compiled version in the ByteString lib (that was of course built with -O2). The ByteString libs was more-or-less the first high performance thing that we wrote and we've learnt plenty more since then. I think there's a good deal more performance too eek out of it yet, both at the low and high level. Duncan
participants (3)
-
Dino Morelli
-
Duncan Coutts
-
Jeremy Shaw