
I've had a look at how some of the code was being compiled for strict and lazy bytestrings, and also which rules weren't firing. With some small tweaks the code seems back in good shape. An updated bytestring library is at : http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-0.9.0.... Enjoy! :) ------------------------------------------------------------------------ Summary: the suspicious lazy bytestring program works now. (constant space, and fastest overall, as expected originally) Program 1, lazy bytestring length . filter Yesterday: ./A +RTS -sstderr < 150M 1.01s user 0.10s system 98% cpu 1.123 total 40M allocated * Today (fixed!): ./A +RTS -sstderr < 150M 0.26s user 0.06s system 96% cpu 0.332 total 2M allocated Reason, deprecated array fusion mucking up the optimiser. I think we can close this regression. ------------------------------------------------------------------------ Also, I had a look at Program 3: lazy bytestring, custom loop Unchanged. 2.4s, constant space. This was a bit slow. Further investigation shows lots of unnecessary bounds checks, as we take apart the Chunk lazy bytestring type, then test and continue. This representation was chosen to make it possible to process chunks efficiently, so that we can avoid these bounds check. Something like this instead: cnt :: Int -> B.ByteString -> Int cnt n B.Empty = n cnt n (B.Chunk x xs) = cnt (n + cnt_strict 0 x) xs -- process lazy spine -- now we can process a chunk without checking for Empty where cnt_string !i !s -- then strict chunk | S.null s = i | c == ' ' = cnt_strict (i+1) t | otherwise = cnt_strict i t where (c,t) = (S.w2c (S.unsafeHead s), S.unsafeTail s) -- no bounds check main = do s <- B.getContents; print (cnt 0 s) Let's us avoid redundant checks for Empty, while allowing 'go' to avoid unnecessary checks for the empty strict bytestring. This is some 4x faster. This alternating between lazy spines and strict chunk processing is the best way to get reliable performance from lazy bytestring custom loops. -- Don