
firefly:
What do you think the relative speeds are of the six small haskell programs at the end of this email?
All they do is read from stdin and count the number of spaces they see. There are two that use strict bytestrings, two that use lazy bytestrings, and two that use the standard Haskell strings. Three use a recursive function with an accumulator parameter and three use a foldl with a lambda function.
Say the fastest one takes the time 1. How much time will the others take?
And how about memory? How much memory do you think they require? Let's say we feed a 150MB(*) file into each of them, how many megabytes do you think they end up using (as seen from the OS, not in terms of how big the live heap is)?
I'm going to post full benchmarks + analysis on Wednesday.
-Peter
*) hardddisk megabytes. The file is 150000034 bytes ≈ 143 mebibytes.
PS: For extra credit, what do you think is the peak memory use for this program when given an input file of 150MB?
Well, I'm not going to wait till Wednesday for the numbers! $ ghc --version The Glorious Glasgow Haskell Compilation System, version 6.8.2 $ du -hs 150M 150M ------------------------------------------------------------------------ Program 1: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B import GHC.Int (Int64) -- note that D.BS.Lazy.Char8.length is ByteString -> Int64 -- D.BS.C8.length is ByteString -> Int cnt :: B.ByteString -> Int64 cnt bs = B.length (B.filter (== ' ') bs) main = do s <- B.getContents print (cnt s) Ok, so lazy bytestrings. Should be constant space use, but two traversals (since there's no lazy bytestring fusion, over each lazy chunk). Not perfect, but should be ok. length will consume chunks as they're produced by filter. ** Prediction: Constant ~3M space use (runtime, plus small overhead) Fast, due to chunk-wise processing. ** Result: $ ghc -O2 A.hs -o A --make $ time ./A < 150M +RTS -sstderr ./A +RTS -sstderr < 150M 1.01s user 0.10s system 98% cpu 1.123 total And top says 40M allocated. ** Summary: Ok, pretty fast, but an unexpected(!) amount of memory allocated. Now, this memory result is suspicious, I wonder if the now obsolete 'array fusion' is messing things up. In Data.ByteString.Lazy, we have: filter p = F.loopArr . F.loopL (F.filterEFL p) F.NoAcc We keep meaning to replace this stuff with the stream fusion mechanisms, which compile a lot better. Perhaps the time has come to look at that :) I'll put this memory allocation down as a bug that needs to be looked at. ------------------------------------------------------------------------ Program 2: hs/space-bs-c8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) Strict bytestrings, and you manually fuse the length/filter calculation. Allocating all that memory will cost you. ** Prediction: O(N) memory, around 150M allocated Slower, due to cache effects (more data to traverse) and more indirections. ** Result top says 154M ./B +RTS -sstderr < 150M 1.10s user 0.52s system 111% cpu 1.454 total ** Summary: Seems reasonable, when its doing all that allocation. ------------------------------------------------------------------------ Program 2a: But we could easily improve this program: Since: length (filter (== ' ') == length (filterByte ' ' == count ' ' And we have: import qualified Data.ByteString.Char8 as B cnt :: B.ByteString -> Int cnt x = B.count ' ' x main = do s <- B.getContents print (cnt s) ** Prediction: Which should be a bit faster. ** Result: $ time ./B < 150M 24569024 ./B < 150M 0.66s user 0.55s system 113% cpu 1.070 total ** Summary: So that's the fastest program so far. The rewrite rules to do these transformatoins used to be enabled, but need looking at again. There should also be no real benefit to manually fuse a length . filter loop like this, however, the old fusion system used in bytestring might have some small overhead here. This also needs looking at. ------------------------------------------------------------------------ Program 2b: We can do even better if we read the file in: import qualified Data.ByteString.Char8 as B import System.IO.Posix.MMap main = print . B.count ' ' =<< mmapFile "150M" ** Prediction: super fast ** Result: $ time ./B2 24569024 ./B2 0.31s user 0.01s system 101% cpu 0.314 total (similar results if you use vanilla B.readFile too, fwiw). Summary: This suggests to me we could look again at how strings of unknown size are read in. ------------------------------------------------------------------------ Program 3: hs/space-bslc8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) Ok, for lazy bytestrings. ** Prediction: Should run in constant space, but there are more checks than in the unfused length/filter case. It should be a bit slower, and hopefully run in constant space. ** Result: $ time ./C < 150M 24569024 ./C < 150M 2.36s user 0.11s system 99% cpu 2.489 total top says 3804K ** Summary: So it runs in the 3M constant space I'd expected the original program to run in, but its a fair bit slower. The generated code looks pretty good though. Investigate why this is slower. ------------------------------------------------------------------------ Program 4: ============================== hs/space-xxxxx-acc-1.hs: {-# LANGUAGE BangPatterns #-} cnt :: Int -> String -> Int cnt !acc bs = if null bs then acc else cnt (if head bs == ' ' then acc+1 else acc) (tail bs) main = do s <- getContents print (cnt 0 s) ** Prediction: Lazy, so constant space, but probably 10x slower than the previous program. ** I'm not going to bother with this one, since strings suck for large data. ------------------------------------------------------------------------ Program 5: hs/space-bs-c8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) Ok, a strict foldl'. ** Prediction: Should be similar to the strict bytestring first example, and a little faster due to no redundant null checks. ** Result $ time ./D < 150M 24569024 ./D < 150M 1.02s user 0.58s system 111% cpu 1.436 tota ** Summary: As expected ------------------------------------------------------------------------ Program 6: ============================== hs/space-bslc8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) A strict fold over a lazy bytestring. Prediction: Ok, constant memory use, and should be similar in speed to the first length. filter program. It'll be faster than manual accumulating lazy bytestring program, since the redundant null checks are gone from the inner loop. Result: $ time ./E < 150M 24569024 ./E < 150M 0.84s user 0.07s system 98% cpu 0.928 total 2468k memory. Wow. Didn't expect it to be that fast. So properly lazy, and nicely fast. This is what we'd hope to see :) It also re-enforces that there's a bug in the length . filter program. ------------------------------------------------------------------------ Program 7: ============================== hs/space-xxxxx-foldl.hs: {-# LANGUAGE BangPatterns #-} cnt :: String -> Int cnt bs = foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- getContents print (cnt s) Hmm. Lazy accumulator eh, on String? Should exhibit a space leak. Not worth running.. ********************** Summary, * Program 1 is fast, as expected, but exhbits a bug in the bytestring library's lazy bytestring fusion system. Something in length or filter isn't doing the right job. This code will be replaced by the stream fusion system soon. * Program 2: as expected. strict IO uses O(N) space, and that has performance effects. * Program 3: lazy bytestrings use constant space, but you better avoid redundant bounds checks in the inner loops. * Program 4: strings are silly * Program 5: as expected. similar to program 2. * Program 6: strict foldl's over lazy bytestrings are good :) fast, and constant space. * Program 7: see program 4. Pretty much as expected then, but with a bug identified in lazy bytestring fusion (I think). Nice little benchmark. -- Don