
Is there an efficient way to iterate over the bytes of a ByteString? For a benchmark, I am just computing the sum of the bytes in an STRef. This is a bit silly, but makes for simple test code. (Accumulating a sum in a boxed STRef is several times slower than computing a full histogram in an STUArray, and foldl' (\x b -> x + fromIntegral b) 0 is faster still). The skeleton of the test program is import qualified Data.ByteString.Lazy as L import Control.Monad.ST import Control.Monad import Data.STRef import Data.Word mapMBS_ST : (Word8 -> ST s ()) -> L.ByteString -> ST s () mapMBS_ST = ??? main = do bytes <- L.readFile "input" print $ runST (do v <- newSTRef (0 :: Word64) mapMBS_ST (\b -> do x <- readSTRef v writeSTRef v $! x + fromIntegral b) bytes readSTRef v) The file "input" is 1GB of random bytes (from /dev/urandom). A version which uses the internal representation and explicit peeks runs several times faster than the best implementation I could write without using the unsafe interface. The safe implementations I tried were sum3 : mapMBS_ST f bytes = L.foldr (\b r -> f b >> r) (return ()) bytes sum4 : mapMBS_ST f bytes = if L.null bytes then return () else f (L.head bytes)
mapMBS_ST f (L.tail bytes) sum5 : mapMBS_ST f bytes = mapM_ f (L.unpack bytes)
sum used an unsafe implementation, and sum2 simply uses foldl'. Here are times. Runs vary by a few hundredths of a second user time. ./sum 8.62s user 0.39s system 99% cpu 9.006 total ./sum2 1.26s user 0.34s system 99% cpu 1.604 total ./sum3 72.25s user 0.50s system 99% cpu 1:12.96 total ./sum4 20.55s user 0.66s system 99% cpu 21.251 total ./sum5 391.18s user 1.23s system 99% cpu 6:32.76 total Comparing sum and sum2 gives some idea of the overhead that comes from the boxed STRef, so the relative overhead of the different iteration routines is probably even higher. Looking at the ByteString code, I suspect the foldr version performs badly because foldr uses one unsafePerformIO over a loop which builds the value in an argument, and GHC doesn't see that pointer accesses in the loop could be lifted out of the unsafePerformIO and interleaved with the IO actions built from the argument. (I don't expect GHC to be too smart about taking advantage of unsafePerformIO). Perhaps foldr could be modified so the foldr version above will optimze nicely. Are the benchmarks in http://darcs.haskell.org/bytestring/ good for avoiding performance regressions? That repository has recent changes, but the page Is http://www.cse.unsw.edu.au/~dons/fps.html which is supposedly the bytestring homepage seems a little out of date. Brandon Code for the unsafe map follows: module BytestringMaps (mapMS_IO, mapMBS_IO, mapMS_ST, mapMBS_ST) where import qualified Data.ByteString.Lazy as L import Data.ByteString as S import Data.ByteString.Unsafe as BU import Data.Word import Control.Monad import Data.Array.Storable import Foreign.Storable import Foreign.Ptr import Control.Monad.ST for l u m = go l where go ix | ix >= u = return () | otherwise = m ix >> go (ix+1) {-# INLINE mapMS_IO #-} mapMS_IO :: (Word8 -> IO ()) -> S.ByteString -> IO () mapMS_IO fun chunk = unsafeUseAsCStringLen chunk (\(ptr,len) -> let ptr' = castPtr ptr :: Ptr Word8 in for 0 len (\ix -> do c <- peekElemOff ptr' ix fun c)) {-# INLINE mapMBS_IO #-} mapMBS_IO :: (Word8 -> IO ()) -> L.ByteString -> IO () mapMBS_IO fun str = mapM_ (mapMS_IO fun) (L.toChunks str) {-# INLINE mapMS_ST #-} mapMS_ST :: (Word8 -> ST s ()) -> S.ByteString -> ST s () mapMS_ST fun chunk = unsafeIOToST $ unsafeUseAsCStringLen chunk (\(ptr,len) -> let ptr' = castPtr ptr :: Ptr Word8 in for 0 len (\ix -> do c <- peekElemOff ptr' ix unsafeSTToIO (fun c))) {-# INLINE mapMBS_ST #-} mapMBS_ST :: (Word8 -> ST s ()) -> L.ByteString -> ST s () mapMBS_ST fun str = mapM_ (mapMS_ST fun) (L.toChunks str)