I'm working on measuring and improving the performance of the text library at the moment, and the very first test I tried demonstrated a piece of behaviour that I'm not completely able to understand. Actually, I'm not able to understand what's going on at all, beyond a very shallow level. All the comments below pertain to GHC 6.10.4.

The text library uses stream fusion, and I want to measure the performance of UTF-8 decoding.

The code I'm measuring is very simple:

import qualified Data.ByteString as B
import Data.Text.Encoding as T

import qualified Data.Text as T
import System.Environment (getArgs)

import Control.Monad (forM_)

main = do
  args <- getArgs

  forM_ args $ \a -> do
   s <- B.readFile a

   let t = T.decodeUtf8 s
   print (T.length t)

The streamUtf8 function looks roughly like this:

streamUtf8 :: OnDecodeError -> ByteString -> Stream Char
streamUtf8 onErr bs = Stream next 0 (maxSize l)

   where
   l = B.length bs
   next i
   | i >= l = Done
   | U8.validate1 x1 = Yield (unsafeChr8 x1) (i+1)

   | {- etc. -}

{-# INLINE [0] streamUtf8 #-}

The values being Yielded from the inner function are, as you can see, themselves constructed by functions.

Originally, with the inner next function manually marked as INLINE, I found that functions like unsafeChr8 were not being inlined by GHC, and performance was terrible due to the amount of boxing and unboxing happening in the inner loop.

I somehow stumbled on the idea of removing the INLINE annotation from next, and performance suddenly improved by a significant integer multiple. This caused the body of streamUtf8 to be inlined into my test program, as I hoped.

However, I wasn't yet out of the woods. The length function is defined as follows:

length :: Text -> Int

length t = Stream.length (Stream.stream t)

{-# INLINE length #-}

And the streaming length is:

length :: Stream Char -> Int
length = S.lengthI
{-# INLINE[1] length #-}

And the lengthI function is defined more generally, in the hope that I could use it for both Int and Int64 lengths:

lengthI :: Integral a => Stream Char -> a

lengthI (Stream next s0 _len) = loop_length 0 s0

   where
   loop_length !z s = case next s of

   Done -> z
   Skip s' -> loop_length z s'

   Yield _ s' -> loop_length (z + 1) s'

{-# INLINE[0] lengthI #-}

Unfortunately, although lengthI is inlined into the Int-typed streaming length function, that function is not in turn marked with __inline_me in simplifier output, so the length/decodeUtf8 loops do not fuse. The code is pretty fast, but there's still a lot of boxing and unboxing happening for all the Yields.

So. I am quite baffled by this, and I confess to having no idea what to do to get the remaining functions to fuse. But that's not quite confusing enough! Here's a one-byte change to my test code:

main = do

  args <- getArgs

  forM_ args $ \a -> do

   s <- B.readFile a

   let !t = decodeUtf8 s {- <-- notice the strictness annotation -}

   print (T.length t)

In principle, this should make the code a little slower, because I'm deliberately forcing a Text value to be created, instead of allowing stream/unstream fusion to occur. Now the length function seems to get inlined properly, but while the decodeUtf8 function is inlined, the functions in its inner loop that must be inlined for performance purposes are not. The result is very slow code.

I found another site for this one test where removing a single INLINE annotation makes the strictified code above 2x faster, but that change causes the stream/unstream fusion rule to fail to fire entirely, so the strictness annotation no longer makes a difference to performance.

All of these flip-flops in inliner behaviour are very difficult to understand, and they seem to be exceedingly fragile. Should I expect the situation to be better with the new inliner in 6.12?

Thanks for bearing with that rather long narrative,

Bryan.