Inliner behaviour - tiny changes lead to huge performance differences

I'm working on measuring and improving the performance of the text library at the moment, and the very first test I tried demonstrated a piece of behaviour that I'm not completely able to understand. Actually, I'm not able to understand what's going on at all, beyond a very shallow level. All the comments below pertain to GHC 6.10.4. The text library uses stream fusion, and I want to measure the performance of UTF-8 decoding. The code I'm measuring is very simple: import qualified Data.ByteString as B import Data.Text.Encoding as T import qualified Data.Text as T import System.Environment (getArgs) import Control.Monad (forM_) main = do args <- getArgs forM_ args $ \a -> do s <- B.readFile a let t = T.decodeUtf8 s print (T.length t) The streamUtf8 function looks roughly like this: streamUtf8 :: OnDecodeError -> ByteString -> Stream Char streamUtf8 onErr bs = Stream next 0 (maxSize l) where l = B.length bs next i | i >= l = Done | U8.validate1 x1 = Yield (unsafeChr8 x1) (i+1) | {- etc. -} {-# INLINE [0] streamUtf8 #-} The values being Yielded from the inner function are, as you can see, themselves constructed by functions. Originally, with the inner next function manually marked as INLINE, I found that functions like unsafeChr8 were not being inlined by GHC, and performance was terrible due to the amount of boxing and unboxing happening in the inner loop. I somehow stumbled on the idea of removing the INLINE annotation from next, and performance suddenly improved by a significant integer multiple. This caused the body of streamUtf8 to be inlined into my test program, as I hoped. However, I wasn't yet out of the woods. The length function is defined as follows: length :: Text -> Int length t = Stream.length (Stream.stream t) {-# INLINE length #-} And the streaming length is: length :: Stream Char -> Int length = S.lengthI {-# INLINE[1] length #-} And the lengthI function is defined more generally, in the hope that I could use it for both Int and Int64 lengths: lengthI :: Integral a => Stream Char -> a lengthI (Stream next s0 _len) = loop_length 0 s0 where loop_length !z s = case next s of Done -> z Skip s' -> loop_length z s' Yield _ s' -> loop_length (z + 1) s' {-# INLINE[0] lengthI #-} Unfortunately, although lengthI is inlined into the Int-typed streaming length function, that function is not in turn marked with __inline_me in simplifier output, so the length/decodeUtf8 loops do not fuse. The code is pretty fast, but there's still a lot of boxing and unboxing happening for all the Yields. So. I am quite baffled by this, and I confess to having no idea what to do to get the remaining functions to fuse. But that's not quite confusing enough! Here's a one-byte change to my test code: main = do args <- getArgs forM_ args $ \a -> do s <- B.readFile a let !t = decodeUtf8 s *{- <-- notice the strictness annotation -}* print (T.length t) In principle, this should make the code a little slower, because I'm deliberately forcing a Text value to be created, instead of allowing stream/unstream fusion to occur. Now the length function seems to get inlined properly, but while the decodeUtf8 function is inlined, the functions in its inner loop that must be inlined for performance purposes are not. The result is very slow code. I found another site for this one test where removing a single INLINEannotation makes the strictified code above 2x faster, but that change causes the stream/unstream fusion rule to fail to fire entirely, so the strictness annotation no longer makes a difference to performance. All of these flip-flops in inliner behaviour are very difficult to understand, and they seem to be exceedingly fragile. Should I expect the situation to be better with the new inliner in 6.12? Thanks for bearing with that rather long narrative, Bryan.

On 13/11/2009, at 18:04, Bryan O'Sullivan wrote:
main = do args <- getArgs forM_ args $ \a -> do s <- B.readFile a let t = T.decodeUtf8 s print (T.length t)
The streamUtf8 function looks roughly like this:
streamUtf8 :: OnDecodeError -> ByteString -> Stream Char streamUtf8 onErr bs = Stream next 0 (maxSize l) where l = B.length bs next i | i >= l = Done | U8.validate1 x1 = Yield (unsafeChr8 x1) (i+1) | {- etc. -} {-# INLINE [0] streamUtf8 #-}
The values being Yielded from the inner function are, as you can see, themselves constructed by functions.
Originally, with the inner next function manually marked as INLINE, I found that functions like unsafeChr8 were not being inlined by GHC, and performance was terrible due to the amount of boxing and unboxing happening in the inner loop.
Let's see if I understand this correctly. In your code, decodeUtf8 calls streamUtf8. They both get inlined into main but then unsafeChr8 does not. Correct? If so, are you sure that unsafeChr8 is really called in the simplified code? IIUC, this isn't necessary if you don't actually inspect the Chars (which length presumably doesn't). So perhaps GHC removes the call altogether? If not, what does it do with the result?
I somehow stumbled on the idea of removing the INLINE annotation from next, and performance suddenly improved by a significant integer multiple. This caused the body of streamUtf8 to be inlined into my test program, as I hoped.
Or are you saying that it's streamUtf8 that isn't getting inlined into main?
length :: Text -> Int length t = Stream.length (Stream.stream t) {-# INLINE length #-}
And the streaming length is:
length :: Stream Char -> Int length = S.lengthI {-# INLINE[1] length #-}
And the lengthI function is defined more generally, in the hope that I could use it for both Int and Int64 lengths:
lengthI :: Integral a => Stream Char -> a lengthI (Stream next s0 _len) = loop_length 0 s0 where loop_length !z s = case next s of Done -> z Skip s' -> loop_length z s' Yield _ s' -> loop_length (z + 1) s' {-# INLINE[0] lengthI #-}
Unfortunately, although lengthI is inlined into the Int-typed streaming length function, that function is not in turn marked with __inline_me in simplifier output, so the length/decodeUtf8 loops do not fuse. The code is pretty fast, but there's still a lot of boxing and unboxing happening for all the Yields.
Does changing the definition of length to length = id S.lengthI help? GHC used to have a bug in this area but I haven't been bitten by it for quite some time. Also, I wonder how Stream.stream is defined. Is it strict in Text? If it isn't, does making it strict help?
All of these flip-flops in inliner behaviour are very difficult to understand, and they seem to be exceedingly fragile. Should I expect the situation to be better with the new inliner in 6.12?
I suspect that the fragility you are seeing is just a symptom of a problem in how the UTF-8 library implements stream fusion. It's a bit tricky to get everything right. Generally, I've found the simplifier to be quite stable and predictable in the last year or so. Simon is working hard on making it even better. If you have a spare minute, perhaps you could try the HEAD with the new inliner and see if that helps? Although I somewhat doubt it, to be honest. Roman

Bryan | I'm working on measuring and improving the performance of the text library at the | moment, and the very first test I tried demonstrated a piece of behaviour | that I'm not completely able to understand. Actually, I'm not able to | understand what's going on at all, beyond a very shallow level. | All the comments below pertain to GHC 6.10.4. My goal is for INLINE pragmas to be very predictable. I can't decode your message enough to offer any insights; thank you Roman, who is closer to it, for helping. As Roman says, I committed a patch two days ago which constitutes a fairly radical overhaul of the way inlining works, strongly motivated by wanting predictability. So can you try the HEAD? If that doesn't work, let's make it concrete. You identify several cases where something unexpected happened. Can you submit a Trac ticket with instructions for reproducing this unexpected behaviour? Just to make it self contained, maybe include any non-standard libraries as an attachment? Simon

Hi All. I think this is a not new question, but probably a missed it. On my MacOS X 10.5 (32 bit) I use GHC 6.10.4 (installed by Mac package GHC-6.10.4-i386.pkg) to build some halkell programs and all is fine. Yesterday I've updated the system to MacOS X 10.6 (64 bit) and now when I try to build the same programs I get the errors: /var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1167:0: 32-bit absolute addressing is not supported for x86-64 /var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1167:0: cannot do signed 4 byte relocation /var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1170:0: 32-bit absolute addressing is not supported for x86-64 /var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1170:0: cannot do signed 4 byte relocation Is there any solution to continue to use the installed GHC on the new 64 bit system? Thanks in advance for any answer. Luca

Hi, Change your /usr/bin/ghc to #!/bin/sh exec /Library/Frameworks/GHC.framework/Versions/610/usr/lib/ghc-6.10.4/ghc -optc-m32 -opta-m32 -optl-m32 -B/Library/Frameworks/GHC.framework/Versions/610/usr/lib/ghc-6.10.4/. -dynload wrapped ${1+"$@"} The options -optc-m32 -opta-m32 -optl-m32 must be added. Christophe.
Hi All. I think this is a not new question, but probably a missed it.
On my MacOS X 10.5 (32 bit) I use GHC 6.10.4 (installed by Mac package GHC-6.10.4-i386.pkg) to build some halkell programs and all is fine. Yesterday I've updated the system to MacOS X 10.6 (64 bit) and now when I try to build the same programs I get the errors:
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1167:0: 32-bit absolute addressing is not supported for x86-64
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1167:0: cannot do signed 4 byte relocation
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1170:0: 32-bit absolute addressing is not supported for x86-64
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1170:0: cannot do signed 4 byte relocation
Is there any solution to continue to use the installed GHC on the new 64 bit system?
Thanks in advance for any answer.
Luca _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Thanks a lot Christophe. This solve my problem. Luca. On Nov 15, 2009, at 1:49 PM, alpheccar wrote:
Hi,
Change your /usr/bin/ghc to
#!/bin/sh
exec /Library/Frameworks/GHC.framework/Versions/610/usr/lib/ghc-6.10.4/ghc -optc-m32 -opta-m32 -optl-m32 -B/Library/Frameworks/GHC.framework/Versions/610/usr/lib/ghc-6.10.4/. -dynload wrapped ${1+"$@"}
The options -optc-m32 -opta-m32 -optl-m32 must be added.
Christophe.
Hi All. I think this is a not new question, but probably a missed it.
On my MacOS X 10.5 (32 bit) I use GHC 6.10.4 (installed by Mac package GHC-6.10.4-i386.pkg) to build some halkell programs and all is fine. Yesterday I've updated the system to MacOS X 10.6 (64 bit) and now when I try to build the same programs I get the errors:
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1167:0: 32-bit absolute addressing is not supported for x86-64
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1167:0: cannot do signed 4 byte relocation
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1170:0: 32-bit absolute addressing is not supported for x86-64
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1170:0: cannot do signed 4 byte relocation
Is there any solution to continue to use the installed GHC on the new 64 bit system?
Thanks in advance for any answer.
Luca _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Also /usr/bin/ghci, /usr/bin/runghc, /usr/bin/runhaskell to patch up all the holes. There may be others, also. Here is a reference to the original thread where someone found out SL broke GHC and then worked through temporarily resolving it: http://old.nabble.com/Snow-Leopard-Breaks-GHC-td25198347.html -Ross On Nov 15, 2009, at 8:03 AM, Luca Ciciriello wrote:
Thanks a lot Christophe. This solve my problem.
Luca.
On Nov 15, 2009, at 1:49 PM, alpheccar wrote:
Hi,
Change your /usr/bin/ghc to
#!/bin/sh
exec /Library/Frameworks/GHC.framework/Versions/610/usr/lib/ghc-6.10.4/ghc -optc-m32 -opta-m32 -optl-m32 -B/Library/Frameworks/GHC.framework/Versions/610/usr/lib/ghc-6.10.4/. -dynload wrapped ${1+"$@"}
The options -optc-m32 -opta-m32 -optl-m32 must be added.
Christophe.
Hi All. I think this is a not new question, but probably a missed it.
On my MacOS X 10.5 (32 bit) I use GHC 6.10.4 (installed by Mac package GHC-6.10.4-i386.pkg) to build some halkell programs and all is fine. Yesterday I've updated the system to MacOS X 10.6 (64 bit) and now when I try to build the same programs I get the errors:
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1167:0: 32-bit absolute addressing is not supported for x86-64
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1167:0: cannot do signed 4 byte relocation
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1170:0: 32-bit absolute addressing is not supported for x86-64
/var/folders/vr/vrW2wwvtFKScalkhVEWujE+++TI/-Tmp-/ghc1613_0/ghc1613_0.s:1170:0: cannot do signed 4 byte relocation
Is there any solution to continue to use the installed GHC on the new 64 bit system?
Thanks in advance for any answer.
Luca _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

On Fri, Nov 13, 2009 at 12:26 AM, Simon Peyton-Jones
My goal is for INLINE pragmas to be very predictable. I can't decode your message enough to offer any insights; thank you Roman, who is closer to it, for helping.
Things are considerably different with HEAD than with 6.10.4. HEAD is indeed spotting and exploiting many of the opportunities for inlining, while 6.10.4 is a bit of a morass. The difference is stark: my test program runs in 0.7 seconds with HEAD, and 1.2 with 6.10.4. Here's a rough table of my results: 6.10.4 8.39 seconds HEAD 0.50 HEAD* 0.50 6.10.4* 0.39 6.10.4** 0.34 The asterisk above denotes the removal of a single INLINE pragma from the text library. The doubled asterisk denotes the removal of a piece of indirection: instead of length defined as lengthI and both marked as INLINE, I manually inlined lengthI into the body of length. For your amusement, GNU "wc -m" takes 1.1 seconds to count the number of Unicode characters in the same file, so I think that our combination of performance and brevity is *wonderful*. Thanks! So HEAD is far better than 6.10.4 (yay!), but a little tweaking of the library code makes the 6.10.4 code faster again (boo!). The HEAD inliner seems, as you hoped, to be behaving far more predictably than its predecessor. If you'd like to investigate the remaining performance discrepancy between 6.10.4 and HEAD, I'll create a Trac ticket with instructions on how to reproduce my numbers. In the time between now and the release of 6.14, I wonder what to do. I'm building 6.12 to see how it fares, but my experience with 6.10 so far suggests that the behaviour of the 6.12 inliner will be fragile and difficult to understand, which is a bit of a shame. On that older code base, it seems that I can get really good fused performance, or okay unfused performance, but not both.

Bryan
It’s good news that the HEAD is better.
To be honest I’m not terribly enthusiastic about trying to nail down exactly what’s happening in 6.10 and 6.12 because, although they are indeed the compilers people will be using, it’s otherwise wasted work because the HEAD is so different.
Can you try with 6.12 and see if you can find a recipe that does well enough? If you get desperate (ie there’s a huge perf bump that you can’t eliminate) then I’ll certainly try to help.
Meanwhile, I don’t know why 6.10 is faster than HEAD (by 25% too) and I’d like to understand that. Can you submit a Trac ticket saying how to reproduce? You might need to bundle up the library too, to make sure we can reproduce it precisely.
Thanks
Simon
From: Bryan O'Sullivan [mailto:bos@serpentine.com]
Sent: 17 November 2009 07:14
To: Simon Peyton-Jones
Cc: glasgow-haskell-users@haskell.org
Subject: Re: Inliner behaviour - tiny changes lead to huge performance differences
On Fri, Nov 13, 2009 at 12:26 AM, Simon Peyton-Jones

On Tue, Nov 17, 2009 at 12:24 AM, Simon Peyton-Jones
To be honest I’m not terribly enthusiastic about trying to nail down exactly what’s happening in 6.10 and 6.12 because, although they are indeed the compilers people will be using, it’s otherwise wasted work because the HEAD is so different.
I don't blame you! That's completely reasonable.
Can you try with 6.12 and see if you can find a recipe that does well enough? If you get desperate (ie there’s a huge perf bump that you can’t eliminate) then I’ll certainly try to help.
Will do, thanks.
Meanwhile, I don’t know why 6.10 is faster than HEAD (by 25% too) and I’d like to understand that. Can you submit a Trac ticket saying how to reproduce? You might need to bundle up the library too, to make sure we can reproduce it precisely.
Certainly. The test program is tiny, but because of all the inlining in the library, the simplifier output is pretty fearsome.

On Fri, Nov 13, 2009 at 12:19 AM, Roman Leshchinskiy
streamUtf8. They both get inlined into main but then unsafeChr8 does not. Correct?
Here's what I see in the simplifer output with 6.10.4: the *unoptimised*body of streamUtf8 is being inlined into main, with many out-of-line functions called in its inner loop, then length is out-of-line applied to that result. I somehow stumbled on the idea of removing the INLINE annotation from next,
and performance suddenly improved by a significant integer multiple. This caused the body of streamUtf8 to be inlined into my test program, as I hoped.
Or are you saying that it's streamUtf8 that isn't getting inlined into main?
When I trimmed that INLINE out, the body of streamUtf8 was being inlined, but differently: all of the functions it had been calling out-of-line were now inlined.
Does changing the definition of length to
length = id S.lengthI
help? GHC used to have a bug in this area but I haven't been bitten by it for quite some time.
That change makes no real difference. It changes the function called at that call site, but it's still out-of-line.
Also, I wonder how Stream.stream is defined. Is it strict in Text? If it isn't, does making it strict help?
It is strict in Text, yes. If you have a spare minute, perhaps you could try the HEAD with the new
inliner and see if that helps? Although I somewhat doubt it, to be honest.
I posted those numbers in a reply to Simon a little while ago. HEAD is generally much better than 6.10, which is great, but I'm still stuck with this mystery on versions of the compiler that people may actually be able to use :-\

On 13/11/2009 07:04, Bryan O'Sullivan wrote:
I'm working on measuring and improving the performance of the text library at the moment, and the very first test I tried demonstrated a piece of behaviour that I'm not completely able to understand. Actually, I'm not able to understand what's going on at all, beyond a very shallow level. All the comments below pertain to GHC 6.10.4.
The text library uses stream fusion, and I want to measure the performance of UTF-8 decoding.
The code I'm measuring is very simple:
import qualified Data.ByteString as B import Data.Text.Encoding as T import qualified Data.Text as T import System.Environment (getArgs) import Control.Monad (forM_)
main = do args <- getArgs forM_ args $ \a -> do s <- B.readFile a let t = T.decodeUtf8 s print (T.length t)
The streamUtf8 function looks roughly like this:
streamUtf8 :: OnDecodeError -> ByteString -> Stream Char streamUtf8 onErr bs = Stream next 0 (maxSize l) where l = B.length bs next i | i >= l = Done | U8.validate1 x1 = Yield (unsafeChr8 x1) (i+1) | {- etc. -} {-# INLINE [0] streamUtf8 #-}
The values being Yielded from the inner function are, as you can see, themselves constructed by functions.
Originally, with the inner next function manually marked as INLINE, I found that functions like unsafeChr8 were not being inlined by GHC, and performance was terrible due to the amount of boxing and unboxing happening in the inner loop.
I somehow stumbled on the idea of removing the INLINE annotation from next, and performance suddenly improved by a significant integer multiple. This caused the body of streamUtf8 to be inlined into my test program, as I hoped.
I think I can explain this one, at least partially. When you mark a function INLINE, GHC does not optimise the body of the function itself, on the grounds that it will be inlined at the call site anyway and get optimised there. Simon just changed this behaviour in GHC 6.12, so that GHC now keeps the original definition for inlining, but also optimises the original function as normal, which is useful if we can't or don't want to inline at a call site for some reason (or indeed if the calling module is being compiled without -O!). What we still don't understand, however, is why streamUtf8 was not being inlined at the call site. That is the root of the problem. We'll need to look more closely at the call site to understand what's going on. Cheers, Simon (PS if any of what I said contradicts what Simon and/or Roman said, then please ignore me and not them :-)

Hello Simon, Friday, November 13, 2009, 4:43:34 PM, you wrote:
I think I can explain this one, at least partially. When you mark a function INLINE, GHC does not optimise the body of the function itself, on the grounds that it will be inlined at the call site anyway and get optimised there.
yes! i've seen this problem in 6.4. in particular, if function to inline had a cycle (i.e. self-recursion), it wasn't optimized nor inlined -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Fri, Nov 13, 2009 at 2:04 AM, Bryan O'Sullivan
And the lengthI function is defined more generally, in the hope that I could use it for both Int and Int64 lengths:
lengthI :: Integral a => Stream Char -> a lengthI (Stream next s0 _len) = loop_length 0 s0 where loop_length !z s = case next s of Done -> z Skip s' -> loop_length z s' Yield _ s' -> loop_length (z + 1) s' {-# INLINE[0] lengthI #-}
Would it help to SPECIALIZE lengthI for Int and Int64?
--
Dave Menendez
participants (9)
-
alpheccar
-
Bryan O'Sullivan
-
Bulat Ziganshin
-
David Menendez
-
Luca Ciciriello
-
Roman Leshchinskiy
-
Ross Mellgren
-
Simon Marlow
-
Simon Peyton-Jones