efficient Bytestring mapM_ for IO/ST?

Is there an efficient way to iterate over the bytes of a ByteString? For a benchmark, I am just computing the sum of the bytes in an STRef. This is a bit silly, but makes for simple test code. (Accumulating a sum in a boxed STRef is several times slower than computing a full histogram in an STUArray, and foldl' (\x b -> x + fromIntegral b) 0 is faster still). The skeleton of the test program is import qualified Data.ByteString.Lazy as L import Control.Monad.ST import Control.Monad import Data.STRef import Data.Word mapMBS_ST : (Word8 -> ST s ()) -> L.ByteString -> ST s () mapMBS_ST = ??? main = do bytes <- L.readFile "input" print $ runST (do v <- newSTRef (0 :: Word64) mapMBS_ST (\b -> do x <- readSTRef v writeSTRef v $! x + fromIntegral b) bytes readSTRef v) The file "input" is 1GB of random bytes (from /dev/urandom). A version which uses the internal representation and explicit peeks runs several times faster than the best implementation I could write without using the unsafe interface. The safe implementations I tried were sum3 : mapMBS_ST f bytes = L.foldr (\b r -> f b >> r) (return ()) bytes sum4 : mapMBS_ST f bytes = if L.null bytes then return () else f (L.head bytes)
mapMBS_ST f (L.tail bytes) sum5 : mapMBS_ST f bytes = mapM_ f (L.unpack bytes)
sum used an unsafe implementation, and sum2 simply uses foldl'. Here are times. Runs vary by a few hundredths of a second user time. ./sum 8.62s user 0.39s system 99% cpu 9.006 total ./sum2 1.26s user 0.34s system 99% cpu 1.604 total ./sum3 72.25s user 0.50s system 99% cpu 1:12.96 total ./sum4 20.55s user 0.66s system 99% cpu 21.251 total ./sum5 391.18s user 1.23s system 99% cpu 6:32.76 total Comparing sum and sum2 gives some idea of the overhead that comes from the boxed STRef, so the relative overhead of the different iteration routines is probably even higher. Looking at the ByteString code, I suspect the foldr version performs badly because foldr uses one unsafePerformIO over a loop which builds the value in an argument, and GHC doesn't see that pointer accesses in the loop could be lifted out of the unsafePerformIO and interleaved with the IO actions built from the argument. (I don't expect GHC to be too smart about taking advantage of unsafePerformIO). Perhaps foldr could be modified so the foldr version above will optimze nicely. Are the benchmarks in http://darcs.haskell.org/bytestring/ good for avoiding performance regressions? That repository has recent changes, but the page Is http://www.cse.unsw.edu.au/~dons/fps.html which is supposedly the bytestring homepage seems a little out of date. Brandon Code for the unsafe map follows: module BytestringMaps (mapMS_IO, mapMBS_IO, mapMS_ST, mapMBS_ST) where import qualified Data.ByteString.Lazy as L import Data.ByteString as S import Data.ByteString.Unsafe as BU import Data.Word import Control.Monad import Data.Array.Storable import Foreign.Storable import Foreign.Ptr import Control.Monad.ST for l u m = go l where go ix | ix >= u = return () | otherwise = m ix >> go (ix+1) {-# INLINE mapMS_IO #-} mapMS_IO :: (Word8 -> IO ()) -> S.ByteString -> IO () mapMS_IO fun chunk = unsafeUseAsCStringLen chunk (\(ptr,len) -> let ptr' = castPtr ptr :: Ptr Word8 in for 0 len (\ix -> do c <- peekElemOff ptr' ix fun c)) {-# INLINE mapMBS_IO #-} mapMBS_IO :: (Word8 -> IO ()) -> L.ByteString -> IO () mapMBS_IO fun str = mapM_ (mapMS_IO fun) (L.toChunks str) {-# INLINE mapMS_ST #-} mapMS_ST :: (Word8 -> ST s ()) -> S.ByteString -> ST s () mapMS_ST fun chunk = unsafeIOToST $ unsafeUseAsCStringLen chunk (\(ptr,len) -> let ptr' = castPtr ptr :: Ptr Word8 in for 0 len (\ix -> do c <- peekElemOff ptr' ix unsafeSTToIO (fun c))) {-# INLINE mapMBS_ST #-} mapMBS_ST :: (Word8 -> ST s ()) -> L.ByteString -> ST s () mapMBS_ST fun str = mapM_ (mapMS_ST fun) (L.toChunks str)

On 3/21/11 4:40 PM, Brandon Moore wrote:
Is there an efficient way to iterate over the bytes of a ByteString?
The code I've been using (rather similar to your unsafe map) is: import qualified Data.ByteString.Internal as BSI import qualified Foreign.ForeignPtr as FFI foldIO :: (a -> Word8 -> IO a) -> a -> ByteString -> IO a foldIO f z0 (BSI.PS fp off len) = FFI.withForeignPtr fp $ \p0 -> do let q = p0 `plusPtr` (off+len) let go z p | z `seq` p `seq` False = undefined | p == q = return z | otherwise = do w <- peek p z' <- f z w go z' (p `plusPtr` 1) go z0 (p0 `plusPtr` off) {-# INLINE foldIO #-} Some things to note: * It's a left fold rather than a right fold, just like foldM, except that we can't generalize it to work for all monads. (We could do a right fold just as easily by starting with p0`plusPtr`(off+len) and counting down to p0`plusPtr`off if desired.) * Because we're just keeping the head pointer, we can increment it as we go instead of using peekElemOff. This improves the performance by only performing one addition per loop (the p++) instead of two (ix++ and *(p+ix)), and by requiring one less register (for ix). * The inline pragma helps performance in a major way. * I haven't actually looked at Core nor tried much to optimize it. This just seems like the easiest way to allow the accumulator to perform IO instead of being pure. (For pure code there's foldl'.) -- Live well, ~wren

From: wren ng thornton
Sent: Mon, March 21, 2011 10:30:48 PM On 3/21/11 4:40 PM, Brandon Moore wrote: Is there an efficient way to iterate over the bytes of a ByteString?
The code I've been using (rather similar to your unsafe map) is:
import qualified Data.ByteString.Internal as BSI import qualified Foreign.ForeignPtr as FFI
foldIO :: (a -> Word8 -> IO a) -> a -> ByteString -> IO a foldIO f z0 (BSI.PS fp off len) = FFI.withForeignPtr fp $ \p0 -> do let q = p0 `plusPtr` (off+len) let go z p | z `seq` p `seq` False = undefined | p == q = return z | otherwise = do w <- peek p z' <- f z w go z' (p `plusPtr` 1) go z0 (p0 `plusPtr` off) {-# INLINE foldIO #-}
I don't need to pass an acumulating parameter. I'll see how well my code runs if I do. With a modified version of foldr, the simple definition mapM_ f x = foldr (\b rest -> f b >> rest) (return ()) is a bit faster than the specialized code from before. The changes to foldr improved performance of this definition by almost 10x for my one benchmark. Here's the strict ByteString code: {-# INLINE foldr'' #-} foldr'' :: (Word8 -> a -> a) -> a -> ByteString -> a foldr'' k v (PS x s l) = inlinePerformIO $ withForeignPtr x $ \ptr -> let { !u = s + l ; go ix | ix == u = v | otherwise = k (inlinePerformIO (do c <- peekElemOff ptr ix; touchForeignPtr x; return c)) (go (ix+1)) } in return (go s) The lazy ByteString version is built with foldrChunks foldr f v = foldrChunks (\chunk rest -> foldr'' f rest chunk) Looking for performance regressions, this seems to be around 10% slower than the current foldr in the test main = do str <- readFile "1GiBFile" print $ length $ foldr (:) [] str Surprisingly, Data.ByteString.Lazy.foldr (:) [] seems to be about twice as fast as unpack! Are there any other benchmarks I should try? Are there any other uses for a lazy ByteString foldr? The test suite from http://darcs.haskell.org/bytestring built after a bit of hacking on the Makefile, but doesn't seem to do much timing. Brandon
participants (2)
-
Brandon Moore
-
wren ng thornton