debugging why we end up calling the wrapper rather than the worker

Hia all, I'm trying to figure out why this piece of code does not optimise the way I expect. It's binary serialisation again. Yes, again. The crux is the write function write :: Int -> (Ptr Word8 -> IO ()) -> Put () write !n body = Put $ \c buf@(Buffer fp o u l) -> if n <= l then write' c fp o u l else write' (flushOld c n fp o u) (newBuffer c n) 0 0 0 where {-# NOINLINE write' #-} write' c !fp !o !u !l = -- warning: this is a tad hardcore B.inlinePerformIO (withForeignPtr fp (\p -> body $! (p `plusPtr` (o+u)))) `seq` c () (Buffer fp o (u+n) (l-n)) In the if condition, the first path is the fast path. We do want this write' join point, we want to make a call to write' at runtime rather than inlining it into both the fast and slow paths. This is why we tag it with NOINLINE. Simon kindly recently fixed a bug where this NOINLINE was not remembered in the .hi files, so it now really does not inline write' into both branches. However, looking at the core/stg we can see that we're always calling the write' wrapper function that takes boxed arguments and calls the wrapper. It seems to me that the calls in both branches ought to be calls directly to the worker rather than the wrapper. We do have all the unboxed arguments available but they get boxed up to make the call to the write' wrapper. This isn't going to do good things for performance. Here's the STG code: So the important question is whether we have unboxed versions of all the args available. First of all lets look at the wrapper and see what args it unboxes when it calls the worker ($wwrite'_s17v in this case). write'_s17W = sat-only \r [w1_s17P w2_s17C w3_s17G w4_s17J w5_s17M] case w2_s17C of w6_s18S { GHC.ForeignPtr.ForeignPtr ww6_s17Q ww7_s17R -> case w3_s17G of w7_s18T { GHC.Base.I# ww8_s17S -> case w4_s17J of w8_s18U { GHC.Base.I# ww9_s17T -> case w5_s17M of w9_s18V { GHC.Base.I# ww10_s17U -> $wwrite'_s17v w1_s17P ww6_s17Q ww7_s17R ww8_s17S ww9_s17T ww10_s17U; }; }; }; }; So out of the args [w1_s17P w2_s17C w3_s17G w4_s17J w5_s17M] all but the first (which is a continuation function) get unboxed and the bits passed on to $wwrite'_s17v. So we should expect that if we have all those components available that we could make a direct call to the worker $wwrite'_s17v rather than going via the wrapper write'_s17W. So lets look at the STG code that we get from this bit of source: if n <= l then write' c fp o u l else write' (flushOld c n fp o u) (newBuffer c n) 0 0 0 (The STG code here is from a use when n was 3): We want to see if all the components that the worker $wwrite'_s17v needs are available: case <=# [3 ww5_s17X] of wild1_s18W { GHC.Base.False -> case Put.newBuffer w_s17Z n1_r16x of sat_s18d { __DEFAULT -> let { sat_s189 = NO_CCS GHC.Base.I#! [ww4_s187]; } in let { sat_s186 = NO_CCS GHC.Base.I#! [ww3_s184]; } in let { sat_s183 = NO_CCS GHC.ForeignPtr.ForeignPtr! [ww1_s180 ww2_s181]; } in let { sat_s18b = \u [] Put.flushOld w_s17Z n1_r16x sat_s183 sat_s186 sat_s189; } in write'_s17W sat_s18b sat_s18d lvl_r16z lvl_r16z lvl_r16z; }; GHC.Base.True -> let { sat_s18l = NO_CCS GHC.Base.I#! [ww5_s17X]; } in let { sat_s18j = NO_CCS GHC.Base.I#! [ww4_s187]; } in let { sat_s18h = NO_CCS GHC.Base.I#! [ww3_s184]; } in let { sat_s18f = NO_CCS GHC.ForeignPtr.ForeignPtr! [ww1_s180 ww2_s181]; } in write'_s17W w_s17Z sat_s18f sat_s18h sat_s18j sat_s18l; }; In the False case it looks like there is an excuse for calling the wrapper, because we're calling the wrapper for both Put.flushOld and Put.newBuffer. For Put.flushOld it's just to construct the continuation which is a boxed arg anyway, so that's not stopping us calling the worker. The Put.newBuffer returns a ForeignPtr and the write' worker takes the components of the ForeignPtr unboxed. However we are doing a case analysis on the result of Put.newBuffer so we could easily extract the components and pass them on to the write' worker. In the True case there is no excuse at all as far as I can figure out. We are explicitly boxing up exactly the arguments that the wrapper unboxes. So we really should be able to call the worker directly with the appropriate unboxed values as args. So what is going on here? Does NOINLINE prevent calling the worker or something?
From a quick experiment it would appear so:
{-# OPTIONS_GHC -fbang-patterns #-} module Foo (foo, bar) where foo :: Int -> Int foo n = bar n bar :: Int -> Int bar !n = bar (n+1) As is, bar will not be inlined because it's recursive and looking at the STG code we see that foo makes a call to bar's worker: Foo.$wbar = \r [ww_sc8] case +# [ww_sc8 1] of sat_sca { __DEFAULT -> Foo.$wbar sat_sca; }; Foo.bar = \r [w_scd] case w_scd of w1_scl { GHC.Base.I# ww_scg -> Foo.$wbar ww_scg; }; Foo.foo = \r [eta_sck] Foo.bar eta_sck; however if we add in what you'd think is a redundant pragma {-# NOINLINE bar #-} and look at the stg code again: Foo.bar = \r [w_sc9] case w_sc9 of w1_sco { GHC.Base.I# ww_scc -> Foo.$wbar ww_scc; }; Foo.$wbar = \r [ww_scf] case +# [ww_scf 1] of sat_sch { __DEFAULT -> let { sat_scj = NO_CCS GHC.Base.I#! [sat_sch]; } in Foo.bar sat_scj; }; Foo.foo = \r [eta_scn] Foo.bar eta_scn; then we see that foo is now calling bar's wrapper, and what's worse, bar is calling it's wrapper in the recursive call! Oh noes! So it seems to me that NOINLINE should prevent inlining but not prevent calling the worker rather than the wrapper. I don't fully understand how NOINLINE interacts with the worker/wrapper transform (or I wouldn't have been surprised by this behaviour). I'm guessing that it works by doing the worker/wrapper split and then trying to inline the wrapper into as many call sites as possible. If this is indeed how it works then it'd explain why attaching NOINLINE to the function causes the observed behaviour since looking at the .hi file we see that the NOINLINE is attached to the wrapper function and not the worker. So perhaps the solution is to attach the NOINLINE to the worker rather than the wrapper when doing the worker/wrapper split. Would that work or cause other problems? Seems otherwise I'm stuck. I thought I could use NOINLINE to control the creation of join points like in my original example. Duncan

| So it seems to me that NOINLINE should prevent inlining but not prevent | calling the worker rather than the wrapper. I don't fully understand how | NOINLINE interacts with the worker/wrapper transform (or I wouldn't have | been surprised by this behaviour). I'm guessing that it works by doing | the worker/wrapper split and then trying to inline the wrapper into as | many call sites as possible. If this is indeed how it works then it'd | explain why attaching NOINLINE to the function causes the observed | behaviour since looking at the .hi file we see that the NOINLINE is | attached to the wrapper function and not the worker. Exactly. And indeed, it doesn't really make sense to do a w/w split on a NOINLINE thing, if this is what happens. | So perhaps the solution is to attach the NOINLINE to the worker rather | than the wrapper when doing the worker/wrapper split. Would that work or | cause other problems? That is easily changed. But consider that you may have put that NOININE pragma there to stop the thing inlining so that a RULE would fire. We presumably do not want to uncritically make a NOINLINE thing into an INLINE thing, just because it's strict; that would nullify the carefully set pragmas to make sure the rules worked. I suggest you say NOINLINE [0]; that prevents inlining until the final phases of compilation, which is probably what you want. See if that works. Meanwhile, I should probably do no w/w for NOINLINE things. I'll postpone until this thread settles. Simon

On Mon, 2007-06-04 at 12:44 +0100, Simon Peyton-Jones wrote:
| So it seems to me that NOINLINE should prevent inlining but not prevent | calling the worker rather than the wrapper. I don't fully understand how | NOINLINE interacts with the worker/wrapper transform (or I wouldn't have | been surprised by this behaviour). I'm guessing that it works by doing | the worker/wrapper split and then trying to inline the wrapper into as | many call sites as possible. If this is indeed how it works then it'd | explain why attaching NOINLINE to the function causes the observed | behaviour since looking at the .hi file we see that the NOINLINE is | attached to the wrapper function and not the worker.
Exactly. And indeed, it doesn't really make sense to do a w/w split on a NOINLINE thing, if this is what happens.
| So perhaps the solution is to attach the NOINLINE to the worker rather | than the wrapper when doing the worker/wrapper split. Would that work or | cause other problems?
That is easily changed. But consider that you may have put that NOININE pragma there to stop the thing inlining so that a RULE would fire. We presumably do not want to uncritically make a NOINLINE thing into an INLINE thing, just because it's strict; that would nullify the carefully set pragmas to make sure the rules worked.
Ah, yes of course.
I suggest you say NOINLINE [0]; that prevents inlining until the final phases of compilation, which is probably what you want. See if that works.
But that allows it to be inlined in phase 0, and that's exactly what I don't want. I really do not want this function inlined, I want it to be a join point. So in this example, I'm not trying to do rule matching, I just want to create a join point. Duncan

| But that allows it to be inlined in phase 0, and that's exactly what I | don't want. I really do not want this function inlined, I want it to be | a join point. Remind me why you really don't want it inlined, ever? Even if it's small etc. S

On Mon, 2007-06-04 at 14:01 +0100, Simon Peyton-Jones wrote:
| But that allows it to be inlined in phase 0, and that's exactly what I | don't want. I really do not want this function inlined, I want it to be | a join point.
Remind me why you really don't want it inlined, ever? Even if it's small etc.
I'm partitioning a fast path and a slow path. I don't want an extra copy of the code hanging around just because of the slow path. So I want both the fast and slow paths to share a single copy of the object code for doing the writing to memory. Then the fast path is just a jump to this code, and the slow path calls a bunch of other out of line functions to fix things up before jumping. So there is no advantage to inlining here, except call overhead, but that should be low too. In fact if it's not impossible to imaging that the jump to the function could be combined with the conditional test & jump, rather than it being a conditional test & jump followed by an unconditional jump in the fast path. The code was: write :: Int -> (Ptr Word8 -> IO ()) -> Put () write !n body = Put $ \c buf@(Buffer fp o u l) -> if n <= l then write' c fp o u l --fast path else write' (flushOld c n fp o u) (newBuffer c n) 0 0 0 where {- NOINLINE write' -} write' c !fp !o !u !l = -- warning: this is a tad hardcore B.inlinePerformIO (withForeignPtr fp (\p -> body $! (p `plusPtr` (o+u)))) `seq` c () (Buffer fp o (u+n) (l-n)) where 'body' is an IO function that writes half a dozen bytes into a memory block. Duncan

| Meanwhile, I should probably do no w/w for NOINLINE things. I've now done this in the HEAD: no worker/wrapper split for NOINLINE things. Yell if anything breaks, but I don't see that anything bad should happen. Simon
participants (2)
-
Duncan Coutts
-
Simon Peyton-Jones