
On 13/11/2009, at 18:04, Bryan O'Sullivan wrote:
main = do args <- getArgs forM_ args $ \a -> do s <- B.readFile a let t = T.decodeUtf8 s print (T.length t)
The streamUtf8 function looks roughly like this:
streamUtf8 :: OnDecodeError -> ByteString -> Stream Char streamUtf8 onErr bs = Stream next 0 (maxSize l) where l = B.length bs next i | i >= l = Done | U8.validate1 x1 = Yield (unsafeChr8 x1) (i+1) | {- etc. -} {-# INLINE [0] streamUtf8 #-}
The values being Yielded from the inner function are, as you can see, themselves constructed by functions.
Originally, with the inner next function manually marked as INLINE, I found that functions like unsafeChr8 were not being inlined by GHC, and performance was terrible due to the amount of boxing and unboxing happening in the inner loop.
Let's see if I understand this correctly. In your code, decodeUtf8 calls streamUtf8. They both get inlined into main but then unsafeChr8 does not. Correct? If so, are you sure that unsafeChr8 is really called in the simplified code? IIUC, this isn't necessary if you don't actually inspect the Chars (which length presumably doesn't). So perhaps GHC removes the call altogether? If not, what does it do with the result?
I somehow stumbled on the idea of removing the INLINE annotation from next, and performance suddenly improved by a significant integer multiple. This caused the body of streamUtf8 to be inlined into my test program, as I hoped.
Or are you saying that it's streamUtf8 that isn't getting inlined into main?
length :: Text -> Int length t = Stream.length (Stream.stream t) {-# INLINE length #-}
And the streaming length is:
length :: Stream Char -> Int length = S.lengthI {-# INLINE[1] length #-}
And the lengthI function is defined more generally, in the hope that I could use it for both Int and Int64 lengths:
lengthI :: Integral a => Stream Char -> a lengthI (Stream next s0 _len) = loop_length 0 s0 where loop_length !z s = case next s of Done -> z Skip s' -> loop_length z s' Yield _ s' -> loop_length (z + 1) s' {-# INLINE[0] lengthI #-}
Unfortunately, although lengthI is inlined into the Int-typed streaming length function, that function is not in turn marked with __inline_me in simplifier output, so the length/decodeUtf8 loops do not fuse. The code is pretty fast, but there's still a lot of boxing and unboxing happening for all the Yields.
Does changing the definition of length to length = id S.lengthI help? GHC used to have a bug in this area but I haven't been bitten by it for quite some time. Also, I wonder how Stream.stream is defined. Is it strict in Text? If it isn't, does making it strict help?
All of these flip-flops in inliner behaviour are very difficult to understand, and they seem to be exceedingly fragile. Should I expect the situation to be better with the new inliner in 6.12?
I suspect that the fragility you are seeing is just a symptom of a problem in how the UTF-8 library implements stream fusion. It's a bit tricky to get everything right. Generally, I've found the simplifier to be quite stable and predictable in the last year or so. Simon is working hard on making it even better. If you have a spare minute, perhaps you could try the HEAD with the new inliner and see if that helps? Although I somewhat doubt it, to be honest. Roman