Inliner behaviour - tiny changes lead to huge performance differences

13 Nov 2009

      I'm working on measuring and improving the performance of the text library
at the moment, and the very first test I tried demonstrated a piece of
behaviour that I'm not completely able to understand. Actually, I'm not able
to understand what's going on at all, beyond a very shallow level. All the
comments below pertain to GHC 6.10.4.

The text library uses stream fusion, and I want to measure the performance
of UTF-8 decoding.

The code I'm measuring is very simple:

import qualified Data.ByteString as B
import Data.Text.Encoding as T
import qualified Data.Text as T
import System.Environment (getArgs)
import Control.Monad (forM_)

main = do
  args <- getArgs
  forM_ args $ \a -> do
    s <- B.readFile a
    let t = T.decodeUtf8 s
    print (T.length t)

The streamUtf8 function looks roughly like this:

streamUtf8 :: OnDecodeError -> ByteString -> Stream Char
streamUtf8 onErr bs = Stream next 0 (maxSize l)
    where
      l = B.length bs
      next i
          | i >= l =          Done
          | U8.validate1 x1 = Yield (unsafeChr8 x1) (i+1)
          | {- etc. -}
{-# INLINE [0] streamUtf8 #-}

The values being Yielded from the inner function are, as you can see,
themselves constructed by functions.

Originally, with the inner next function manually marked as INLINE, I found
that functions like unsafeChr8 were not being inlined by GHC, and
performance was terrible due to the amount of boxing and unboxing happening
in the inner loop.

I somehow stumbled on the idea of removing the INLINE annotation from next,
and performance suddenly improved by a significant integer multiple. This
caused the body of streamUtf8 to be inlined into my test program, as I
hoped.

However, I wasn't yet out of the woods. The length function is defined as
follows:

length :: Text -> Int
length t = Stream.length (Stream.stream t)
{-# INLINE length #-}

And the streaming length is:

length :: Stream Char -> Int
length = S.lengthI
{-# INLINE[1] length #-}

And the lengthI function is defined more generally, in the hope that I could
use it for both Int and Int64 lengths:

lengthI :: Integral a => Stream Char -> a
lengthI (Stream next s0 _len) = loop_length 0 s0
    where
      loop_length !z s  = case next s of
                           Done       -> z
                           Skip    s' -> loop_length z s'
                           Yield _ s' -> loop_length (z + 1) s'
{-# INLINE[0] lengthI #-}

Unfortunately, although lengthI is inlined into the Int-typed streaming
length function, that function is not in turn marked with __inline_me in
simplifier output, so the length/decodeUtf8 loops do not fuse. The code is
pretty fast, but there's still a lot of boxing and unboxing happening for
all the Yields.

So. I am quite baffled by this, and I confess to having no idea what to do
to get the remaining functions to fuse. But that's not quite confusing
enough! Here's a one-byte change to my test code:

main = do
  args <- getArgs
  forM_ args $ \a -> do
    s <- B.readFile a
    let !t = decodeUtf8 s *{- <-- notice the strictness annotation -}*
    print (T.length t)

In principle, this should make the code a little slower, because I'm
deliberately forcing a Text value to be created, instead of allowing
stream/unstream fusion to occur. Now the length function seems to get
inlined properly, but while the decodeUtf8 function is inlined, the
functions in its inner loop that must be inlined for performance purposes
are not. The result is very slow code.

I found another site for this one test where removing a single
INLINEannotation makes the strictified code above 2x faster, but that
change
causes the stream/unstream fusion rule to fail to fire entirely, so the
strictness annotation no longer makes a difference to performance.

All of these flip-flops in inliner behaviour are very difficult to
understand, and they seem to be exceedingly fragile. Should I expect the
situation to be better with the new inliner in 6.12?

Thanks for bearing with that rather long narrative,
Bryan.

Inliner behaviour - tiny changes lead to huge performance differences

Bryan O'Sullivan