
On Sat, 2011-10-29 at 14:18 +0200, Bas van Dijk wrote:
Any idea why "toStrict/simple #1" is faster than "toStrict/optimized #1"?
tbh, I missed #1 being actually slower... Thx for pointing that out! The reason is simply, that B.concat performs zero-copying for 0 and 1 chunks: -- | /O(n)/ Concatenate a list of ByteStrings. concat :: [ByteString] -> ByteString concat [] = empty concat [ps] = ps concat xs = ... I've added those zero-copy optimizations to the optimized toStrict2, and now all timings are better than for the naive implementation toStrict1: -------------------------------------------------------------------------- {-# LANGUAGE OverloadedStrings #-} import Criterion import Criterion.Main import qualified Data.ByteString as B import qualified Data.ByteString.Internal as BI import qualified Data.ByteString.Lazy as BL import qualified Data.ByteString.Lazy.Internal as BLI import Foreign.ForeignPtr import Foreign.Ptr toStrict1 :: BL.ByteString -> B.ByteString toStrict1 = B.concat . BL.toChunks toStrict2 :: BL.ByteString -> B.ByteString toStrict2 BLI.Empty = B.empty toStrict2 (BLI.Chunk c BLI.Empty) = c toStrict2 lb = BI.unsafeCreate len $ go lb where len = BLI.foldlChunks (\l sb -> l + B.length sb) 0 lb go BLI.Empty _ = return () go (BLI.Chunk (BI.PS fp s l) r) ptr = withForeignPtr fp $ \p -> do BI.memcpy ptr (p `plusPtr` s) (fromIntegral l) go r (ptr `plusPtr` l) main :: IO () main = do let lbs0 = "" lbs1 = "abcdefghij" lbs2 = BL.fromChunks (replicate 2 "abcdefghij") lbs3 = BL.fromChunks (replicate 10 "abcdefghij") lbs4 = BL.fromChunks (replicate 1000 "abcdefghij") print $ toStrict1 lbs0 == toStrict2 lbs0 print $ toStrict1 lbs1 == toStrict2 lbs1 print $ toStrict1 lbs2 == toStrict2 lbs2 print $ toStrict1 lbs3 == toStrict2 lbs3 print $ toStrict1 lbs4 == toStrict2 lbs4 defaultMain [ bgroup "toStrict" [ bench "simple #0" $ whnf toStrict1 lbs0 , bench "simple #1" $ whnf toStrict1 lbs1 , bench "simple #2" $ whnf toStrict1 lbs2 , bench "simple #3" $ whnf toStrict1 lbs3 , bench "simple #4" $ whnf toStrict1 lbs4 , bench "optimized #0" $ whnf toStrict2 lbs0 , bench "optimized #1" $ whnf toStrict2 lbs1 , bench "optimized #2" $ whnf toStrict2 lbs2 , bench "optimized #3" $ whnf toStrict2 lbs3 , bench "optimized #4" $ whnf toStrict2 lbs4 ] ] {- True True True True True warming up estimating clock resolution... mean is 2.537877 us (320001 iterations) found 2658 outliers among 319999 samples (0.8%) 2292 (0.7%) high severe estimating cost of a clock call... mean is 55.38578 ns (15 iterations) benchmarking toStrict/simple #0 mean: 17.66316 ns, lb 17.62767 ns, ub 17.74077 ns, ci 0.950 std dev: 258.0479 ps, lb 146.2057 ps, ub 417.6999 ps, ci 0.950 benchmarking toStrict/simple #1 mean: 28.59188 ns, lb 28.49765 ns, ub 28.70322 ns, ci 0.950 std dev: 523.0005 ps, lb 415.2360 ps, ub 688.4154 ps, ci 0.950 benchmarking toStrict/simple #2 mean: 144.5600 ns, lb 144.2192 ns, ub 145.2525 ns, ci 0.950 std dev: 2.397734 ns, lb 1.352302 ns, ub 3.905039 ns, ci 0.950 benchmarking toStrict/simple #3 mean: 488.0094 ns, lb 486.6532 ns, ub 490.3121 ns, ci 0.950 std dev: 8.903734 ns, lb 6.046690 ns, ub 13.68234 ns, ci 0.950 benchmarking toStrict/simple #4 mean: 55.65404 us, lb 55.43386 us, ub 55.97341 us, ci 0.950 std dev: 1.342695 us, lb 989.9903 ns, ub 1.836054 us, ci 0.950 benchmarking toStrict/optimized #0 mean: 14.14306 ns, lb 14.10655 ns, ub 14.20415 ns, ci 0.950 std dev: 237.0362 ps, lb 159.7752 ps, ub 347.7329 ps, ci 0.950 benchmarking toStrict/optimized #1 mean: 19.10087 ns, lb 19.05273 ns, ub 19.18831 ns, ci 0.950 std dev: 322.9545 ps, lb 201.9111 ps, ub 503.2965 ps, ci 0.950 benchmarking toStrict/optimized #2 mean: 63.34285 ns, lb 63.21118 ns, ub 63.59386 ns, ci 0.950 std dev: 903.1160 ps, lb 543.0056 ps, ub 1.377267 ns, ci 0.950 benchmarking toStrict/optimized #3 mean: 166.5292 ns, lb 166.2405 ns, ub 167.1715 ns, ci 0.950 std dev: 2.144647 ns, lb 1.259903 ns, ub 3.408946 ns, ci 0.950 benchmarking toStrict/optimized #4 mean: 13.05338 us, lb 13.02102 us, ub 13.11728 us, ci 0.950 std dev: 222.8512 ns, lb 116.9145 ns, ub 343.6604 ns, ci 0.950 -}