[Git][ghc/ghc][wip/T23109a] 3 commits: Improve the Simplifier

29 Apr 2025

Simon Peyton Jones pushed to branch wip/T23109a at Glasgow Haskell Compiler / GHC


Commits:
d4279916 by Simon Peyton Jones at 2025-04-29T11:25:18+01:00
Improve the Simplifier

While working on #23109, I made two improvements to the Simplifier

* I found that the Simplifier was sometimes iterating more than it should.
  I fixed this by improving postInlineUnconditionally.

* I refactored tryCastWorkerWrapper.  It is now clearer, and does less
  repeated work.  This allowed me to call it from makeTrivial, which again
  does a bit more in one pass, elminating a potential extra Simplifier
  iteration

More care in postInline

Don't inline data con apps so vigorously

needs more docs

Wibbles

Be more careful in mkDupableContWithDmds

to not create a binding that will immediately be inlined

Do post-inline used-once bindings

This makes cacheprof not regress and seems generally a good plan

More eperiments

* Don't inline toplevel things so much
* Don't float constants so vigorously in the first float-out

Comments only

Refator GHC.Core.Opt.SetLevels.notWorthFloating

I refactored `notWorthFloating` while I was doing something else.

I don't think there's a change in behaviour, but if so it's very much
a corner case.

Always float bottoming expressions to the top

...regardless of floatConsts

Comments only

Wibble SetLevels

Try getting rid of this early-phase business

Don't float PAPs to top level

...and treat case alternatives as strict contexts

Wibble to postInlineUnconditionally

Small wibbles

Don't make error calls interesting.
Literals say True too isSaturatedConApp

Import wibble

Tiny change to saves_alloc

Float lambdas (and PAPs) out of lambdas to top level
This improves spectral/cse

But the old comment was
        -- is_con_app: don't float PAPs to the top; they may well end
        -- up getting eta-expanded and re-inlined
        -- E.g.   f = \x -> (++) ys
        -- If we float, then eta-expand we get
        --      lvl = (++) ys
        --      f = \x \zs -> lvl zs
        -- and now we'll inline lvl.  Silly.

Let's see what CI says

- - - - -
2d087046 by Simon Peyton Jones at 2025-04-29T11:25:28+01:00
Specialise the (higher order) showSignedFloat

- - - - -
680f1b60 by Simon Peyton Jones at 2025-04-29T11:25:28+01:00
Eta reduce augment and its rules

... to match foldr.  I found this reduced some simplifer iterations

Fix `augment`!

- - - - -


10 changed files:

- compiler/GHC/Core/Opt/OccurAnal.hs
- compiler/GHC/Core/Opt/Pipeline.hs
- compiler/GHC/Core/Opt/SetLevels.hs
- compiler/GHC/Core/Opt/Simplify/Env.hs
- compiler/GHC/Core/Opt/Simplify/Iteration.hs
- compiler/GHC/Core/Opt/Simplify/Utils.hs
- compiler/GHC/Core/Utils.hs
- libraries/ghc-internal/src/GHC/Internal/Base.hs
- libraries/ghc-internal/src/GHC/Internal/Float.hs
- testsuite/tests/simplCore/should_run/simplrun009.hs


Changes:

=====================================
compiler/GHC/Core/Opt/OccurAnal.hs
=====================================
@@ -55,7 +55,7 @@ import GHC.Types.Tickish
 import GHC.Types.Var.Set
 import GHC.Types.Var.Env
 import GHC.Types.Var
-import GHC.Types.Demand ( argOneShots, argsOneShots, isDeadEndSig )
+import GHC.Types.Demand ( argOneShots, argsOneShots {- , isDeadEndSig -} )
 
 import GHC.Utils.Outputable
 import GHC.Utils.Panic
@@ -1096,14 +1096,14 @@ mkNonRecRhsCtxt lvl bndr unf
     certainly_inline -- See Note [Cascading inlines]
       = -- mkNonRecRhsCtxt is only used for non-join points, so occAnalBind
         -- has set the OccInfo for this binder before calling occAnalNonRecRhs
+        -- Distressing delicacy ... has to line up with preInlineUnconditionally
         case idOccInfo bndr of
           OneOcc { occ_in_lam = NotInsideLam, occ_n_br = 1 }
-            -> active && not stable_unf && not top_bottoming
+            -> active && not (isTopLevel lvl) && not stable_unf
           _ -> False
 
     active     = isAlwaysActive (idInlineActivation bndr)
     stable_unf = isStableUnfolding unf
-    top_bottoming = isTopLevel lvl && isDeadEndId bndr
 
 -----------------
 occAnalRecBind :: OccEnv -> TopLevelFlag -> ImpRuleEdges -> [(Var,CoreExpr)]
@@ -2580,8 +2580,9 @@ occAnalArgs !env fun args !one_shots
 
     -- Make bottoming functions interesting
     -- See Note [Bottoming function calls]
-    encl | Var f <- fun, isDeadEndSig (idDmdSig f) = OccScrut
-         | otherwise                               = OccVanilla
+--    encl | Var f <- fun, isDeadEndSig (idDmdSig f) = OccScrut
+--         | otherwise                               = OccVanilla
+    encl = OccVanilla
 
     go uds fun [] _ = WUD uds fun
     go uds fun (arg:args) one_shots
@@ -2606,7 +2607,8 @@ Consider
    let x = (a,b) in
    case p of
       A -> ...(error x)..
-      B -> ...(ertor x)...
+      B -> ...(error x)...
+      C -> ..blah...
 
 postInlineUnconditionally may duplicate x's binding, but sometimes it
 does so only if the use site IsInteresting.  Pushing allocation into error
@@ -2616,6 +2618,9 @@ setting occ_encl = OccScrut for such calls.
 The slightly-artificial test T21128 is a good example.  It's probably
 not a huge deal.
 
+ToDo!!!  Fix comment.   Now postinlineUnconditionally ignores intersting-ness for
+non-top-level things.
+
 Note [Arguments of let-bound constructors]
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Consider


=====================================
compiler/GHC/Core/Opt/Pipeline.hs
=====================================
@@ -217,7 +217,7 @@ getCoreToDo dflags hpt_rule_base extra_vars
         if full_laziness then
            CoreDoFloatOutwards $ FloatOutSwitches
                 { floatOutLambdas     = Just 0
-                , floatOutConstants   = True
+                , floatOutConstants   = False  -- Initially
                 , floatOutOverSatApps = False
                 , floatToTopLevelOnly = False
                 , floatJoinsToTop     = False  -- Initially, don't float join points at all
@@ -284,7 +284,7 @@ getCoreToDo dflags hpt_rule_base extra_vars
         --        f_el22 (f_el21 r_midblock)
         runWhen full_laziness $ CoreDoFloatOutwards $ FloatOutSwitches
                { floatOutLambdas     = floatLamArgs dflags
-               , floatOutConstants   = True
+               , floatOutConstants   = True  -- For SpecConstr and CSE
                , floatOutOverSatApps = True
                , floatToTopLevelOnly = False
                , floatJoinsToTop     = True },


=====================================
compiler/GHC/Core/Opt/SetLevels.hs
=====================================
@@ -406,7 +406,7 @@ lvlApp env orig_expr ((_,AnnVar fn), args)
   , arity < n_val_args
   , Nothing <- isClassOpId_maybe fn
   =  do { rargs' <- mapM (lvlNonTailMFE env False) rargs
-        ; lapp'  <- lvlNonTailMFE env False lapp
+        ; lapp'  <- lvlNonTailMFE env True lapp
         ; return (foldl' App lapp' rargs') }
 
   | otherwise
@@ -482,14 +482,14 @@ Consider this:
   f :: T Int -> blah
   f x vs = case x of { MkT y ->
              let f vs = ...(case y of I# w -> e)...f..
-             in f vs
+             in f vs }
 
 Here we can float the (case y ...) out, because y is sure
 to be evaluated, to give
   f x vs = case x of { MkT y ->
-           case y of I# w ->
+           case y of { I# w ->
              let f vs = ...(e)...f..
-             in f vs
+             in f vs }}
 
 That saves unboxing it every time round the loop.  It's important in
 some DPH stuff where we really want to avoid that repeated unboxing in
@@ -614,7 +614,8 @@ lvlMFE env strict_ctxt e@(_, AnnCase {})
   = lvlExpr env e     -- See Note [Case MFEs]
 
 lvlMFE env strict_ctxt ann_expr
-  | not float_me
+  |  notWorthFloating expr abs_vars
+  || not float_me
   || floatTopLvlOnly env && not (isTopLvl dest_lvl)
          -- Only floating to the top level is allowed.
   || hasFreeJoin env fvs   -- If there is a free join, don't float
@@ -623,9 +624,6 @@ lvlMFE env strict_ctxt ann_expr
          -- We can't let-bind an expression if we don't know
          -- how it will be represented at runtime.
          -- See Note [Representation polymorphism invariants] in GHC.Core
-  || notWorthFloating expr abs_vars
-         -- Test notWorhtFloating last;
-         -- See Note [Large free-variable sets]
   = -- Don't float it out
     lvlExpr env ann_expr
 
@@ -676,12 +674,11 @@ lvlMFE env strict_ctxt ann_expr
     is_function  = isFunction ann_expr
     mb_bot_str   = exprBotStrictness_maybe expr
                            -- See Note [Bottoming floats]
-                           -- esp Bottoming floats (2)
+                           -- esp Bottoming floats (BF2)
     expr_ok_for_spec = exprOkForSpeculation expr
     abs_vars = abstractVars dest_lvl env fvs
     dest_lvl = destLevel env fvs fvs_ty is_function is_bot_lam
-               -- NB: is_bot_lam not is_bot; see (3) in
-               --     Note [Bottoming floats]
+               -- NB: is_bot_lam not is_bot; see (BF2) in Note [Bottoming floats]
 
     -- float_is_new_lam: the floated thing will be a new value lambda
     -- replacing, say (g (x+4)) by (lvl x).  No work is saved, nor is
@@ -698,20 +695,32 @@ lvlMFE env strict_ctxt ann_expr
 
         -- A decision to float entails let-binding this thing, and we only do
         -- that if we'll escape a value lambda, or will go to the top level.
+        -- Never float trivial expressions;
+        --   notably, save_work might be true of a lone evaluated variable.
     float_me = saves_work || saves_alloc || is_mk_static
 
     -- See Note [Saving work]
-    saves_work = escapes_value_lam        -- (a)
-                 && not (exprIsHNF expr)  -- (b)
-                 && not float_is_new_lam  -- (c)
+    is_hnf = exprIsHNF expr
+    saves_work = escapes_value_lam        -- (SW-a)
+                 && not is_hnf            -- (SW-b)
+                 && not float_is_new_lam  -- (SW-c)
     escapes_value_lam = dest_lvl `ltMajLvl` (le_ctxt_lvl env)
 
-    -- See Note [Saving allocation] and Note [Floating to the top]
-    saves_alloc =  isTopLvl dest_lvl
-                && floatConsts env
-                && (   not strict_ctxt                     -- (a)
-                    || exprIsHNF expr                      -- (b)
-                    || (is_bot_lam && escapes_value_lam))  -- (c)
+    -- See Note [Floating to the top]
+--    is_con_app = isSaturatedConApp expr  -- True of literal strings too
+    saves_alloc = isTopLvl dest_lvl
+               && (escapes_value_lam || floatConsts env)
+                  -- Always float allocation out of a value lambda
+                  --    if it gets to top level
+               && (not strict_ctxt || is_hnf || is_bot_lam)
+        -- is_con_app: don't float PAPs to the top; they may well end
+        -- up getting eta-expanded and re-inlined
+        -- E.g.   f = \x -> (++) ys
+        -- If we float, then eta-expand we get
+        --      lvl = (++) ys
+        --      f = \x \zs -> lvl zs
+        -- and now we'll inline lvl.  Silly.
+
 
 hasFreeJoin :: LevelEnv -> DVarSet -> Bool
 -- Has a free join point which is not being floated to top level.
@@ -726,22 +735,22 @@ hasFreeJoin env fvs
 The key idea in let-floating is to
   * float a redex out of a (value) lambda
 Doing so can save an unbounded amount of work.
-But see also Note [Saving allocation].
+But see also Note [Floating to the top].
 
 So we definitely float an expression out if
-(a) It will escape a value lambda (escapes_value_lam)
-(b) The expression is not a head-normal form (exprIsHNF); see (SW1, SW2).
-(c) Floating does not require wrapping it in value lambdas (float_is_new_lam).
+(SW-a) It will escape a value lambda (escapes_value_lam)
+(SW-b) The expression is not a head-normal form (exprIsHNF); see (SW1, SW2).
+(SW-c) Floating does not require wrapping it in value lambdas (float_is_new_lam).
     See (SW3) below
 
 Wrinkles:
 
-(SW1) Concerning (b) I experimented with using `exprIsCheap` rather than
+(SW1) Concerning (SW-b) I experimented with using `exprIsCheap` rather than
       `exprIsHNF` but the latter seems better, according to nofib
       (`spectral/mate` got 10% worse with exprIsCheap).  It's really a bit of a
       heuristic.
 
-(SW2) What about omitting (b), and hence floating HNFs as well?  The danger of
+(SW2) What about omitting (SW-b), and hence floating HNFs as well?  The danger of
       doing so is that we end up floating out a HNF from a cold path (where it
       might never get allocated at all) and allocating it all the time
       regardless.  Example
@@ -760,7 +769,7 @@ Wrinkles:
        - Occasionally decreases runtime allocation (T12996 -2.5%)
        - Slightly mixed effect on nofib: (puzzle -10%, mate -5%, cichelli +5%)
          but geometric mean is -0.09%.
-      Overall, a win.
+      Overall, a small win.
 
 (SW3) Concerning (c), if we are wrapping the thing in extra value lambdas (in
       abs_vars), then nothing is saved.  E.g.
@@ -771,10 +780,12 @@ Wrinkles:
       we have saved nothing: one pair will still be allocated for each
       call of `f`.  Hence the (not float_is_new_lam) in saves_work.
 
-Note [Saving allocation]
-~~~~~~~~~~~~~~~~~~~~~~~~
+Note [Floating to the top]
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 Even if `saves_work` is false, we we may want to float even cheap/HNF
-expressions out of value lambdas, for several reasons:
+expressions out of value lambdas. Data suggests, however, that it is better
+/only/ to do so, /if/ they can go to top level. If the expression goes to top
+level we don't pay the cost of allocating cold-path thunks described in (SW2).
 
 * Doing so may save allocation. Consider
         f = \x.  .. (\y.e) ...
@@ -782,6 +793,11 @@ expressions out of value lambdas, for several reasons:
   (assuming e does not mention x). An example where this really makes a
   difference is simplrun009.
 
+* In principle this would be true even if the (\y.e) didn't go to top level; but
+  in practice we only float a HNF if it goes all way to the top.  We don't pay
+  /any/ allocation cost for a top-level floated expression; it just becomes
+  static data.
+
 * It may allow SpecContr to fire on functions. Consider
         f = \x. ....(f (\y.e))....
   After floating we get
@@ -793,21 +809,7 @@ expressions out of value lambdas, for several reasons:
   a big difference for string literals and bottoming expressions: see Note
   [Floating to the top]
 
-Data suggests, however, that it is better /only/ to float HNFS, /if/ they can go
-to top level. See (SW2) of Note [Saving work].  If the expression goes to top
-level we don't pay the cost of allocating cold-path thunks described in (SW2).
-
-Hence `isTopLvl dest_lvl` in `saves_alloc`.
-
-Note [Floating to the top]
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-Even though Note [Saving allocation] suggests that we should not, in
-general, float HNFs, the balance change if it goes to the top:
-
-* We don't pay an allocation cost for the floated expression; it
-  just becomes static data.
-
-* Floating string literal is valuable -- no point in duplicating the
+* Floating string literals is valuable -- no point in duplicating the
   at each call site!
 
 * Floating bottoming expressions is valuable: they are always cold
@@ -815,32 +817,32 @@ general, float HNFs, the balance change if it goes to the top:
   can be quite big, inhibiting inlining. See Note [Bottoming floats]
 
 So we float an expression to the top if:
-  (a) the context is lazy (so we get allocation), or
-  (b) the expression is a HNF (so we get allocation), or
-  (c) the expression is bottoming and floating would escape a
-      value lambda (NB: if the expression itself is a lambda, (b)
-      will apply; so this case only catches bottoming thunks)
+  (FT1) the context is lazy (so we get allocation), or
+  (FT2) the expression is a HNF (so we get allocation), or
+  (FT3) the expression is bottoming and floating would escape a
+         value lambda (NB: if the expression itself is a lambda, (b)
+         will apply; so this case only catches bottoming thunks)
 
 Examples:
 
-* (a) Strict.  Case scrutinee
+* (FT1) Strict.  Case scrutinee
       f = case g True of ....
   Don't float (g True) to top level; then we have the admin of a
   top-level thunk to worry about, with zero gain.
 
-* (a) Strict.  Case alternative
+* (FT1) Strict.  Case alternative
       h = case y of
              True  -> g True
              False -> False
   Don't float (g True) to the top level
 
-* (b) HNF
+* (FT2) HNF
       f = case y of
             True  -> p:q
             False -> blah
   We may as well float the (p:q) so it becomes a static data structure.
 
-* (c) Bottoming expressions; see also Note [Bottoming floats]
+* (FT3) Bottoming expressions; see also Note [Bottoming floats]
       f x = case x of
               0 -> error <big thing>
               _ -> x+1
@@ -853,7 +855,7 @@ Examples:
   'foo' anyway.  So float bottoming things only if they escape
   a lambda.
 
-* Arguments
+* (FT4) Arguments
      t = f (g True)
   Prior to Apr 22 we didn't float (g True) to the top if f was strict.
   But (a) this only affected CAFs, because if it escapes a value lambda
@@ -868,28 +870,6 @@ early loses opportunities for RULES which (needless to say) are
 important in some nofib programs (gcd is an example).  [SPJ note:
 I think this is obsolete; the flag seems always on.]
 
-Note [Large free-variable sets]
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-In #24471 we had something like
-     x1 = I# 1
-     ...
-     x1000 = I# 1000
-     foo = f x1 (f x2 (f x3 ....))
-So every sub-expression in `foo` has lots and lots of free variables.  But
-none of these sub-expressions float anywhere; the entire float-out pass is a
-no-op.
-
-In lvlMFE, we want to find out quickly if the MFE is not-floatable; that is
-the common case.  In #24471 it turned out that we were testing `abs_vars` (a
-relatively complicated calculation that takes at least O(n-free-vars) time to
-compute) for every sub-expression.
-
-Better instead to test `float_me` early. That still involves looking at
-dest_lvl, which means looking at every free variable, but the constant factor
-is a lot better.
-
-ToDo: find a way to fix the bad asymptotic complexity.
-
 Note [Floating join point bindings]
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Mostly we don't float join points at all -- we want them to /stay/ join points.
@@ -1053,30 +1033,36 @@ we'd like to float the call to error, to get
 
 But, as ever, we need to be careful:
 
-(1) We want to float a bottoming
+(BF1) We want to float a bottoming
     expression even if it has free variables:
         f = \x. g (let v = h x in error ("urk" ++ v))
     Then we'd like to abstract over 'x', and float the whole arg of g:
         lvl = \x. let v = h x in error ("urk" ++ v)
         f = \x. g (lvl x)
-    To achieve this we pass is_bot to destLevel
-
-(2) We do not do this for lambdas that return
-    bottom.  Instead we treat the /body/ of such a function specially,
-    via point (1).  For example:
+    To achieve this we pass `is_bot` to destLevel
+
+(BF2) We do the same for /lambdas/ that return bottom.
+    Suppose the original lambda had /no/ free vars:
+        f = \x. ....(\y z. error (y++z))...
+    then we'd like to float that whole lambda
+        lvl = \y z. error (y++z)
+        f = \x. ....lvl....
+    If we just floated its bottom-valued body, we might abstract the arguments in
+    the "wrong" order and end up with this bad result
+        lvl = \z y. error (y++z)
+        f = \x. ....(\y z. lvl z y)....
+
+    If the lambda does have free vars, this will happen:
         f = \x. ....(\y z. if x then error y else error z)....
-    If we float the whole lambda thus
+    We float the whole lambda thus
         lvl = \x. \y z. if x then error y else error z
         f = \x. ...(lvl x)...
-    we may well end up eta-expanding that PAP to
+    And we may well end up eta-expanding that PAP to
+        lvl = \x. \y z. if b then error y else error z
         f = \x. ...(\y z. lvl x y z)...
+    so we get a (small) closure.  So be it.
 
-    ===>
-        lvl = \x z y. if b then error y else error z
-        f = \x. ...(\y z. lvl x z y)...
-    (There is no guarantee that we'll choose the perfect argument order.)
-
-(3) If we have a /binding/ that returns bottom, we want to float it to top
+(BF3) If we have a /binding/ that returns bottom, we want to float it to top
     level, even if it has free vars (point (1)), and even it has lambdas.
     Example:
        ... let { v = \y. error (show x ++ show y) } in ...
@@ -1092,7 +1078,6 @@ But, as ever, we need to be careful:
     join points (#24768), and floating to the top would abstract over those join
     points, which we should never do.
 
-
 See Maessen's paper 1999 "Bottom extraction: factoring error handling out
 of functional programs" (unpublished I think).
 
@@ -1135,7 +1120,6 @@ float the case (as advocated here) we won't float the (build ...y..)
 either, so fusion will happen.  It can be a big effect, esp in some
 artificial benchmarks (e.g. integer, queens), but there is no perfect
 answer.
-
 -}
 
 annotateBotStr :: Id -> Arity -> Maybe (Arity, DmdSig, CprSig) -> Id
@@ -1152,69 +1136,124 @@ annotateBotStr id n_extra mb_bot_str
   = id
 
 notWorthFloating :: CoreExpr -> [Var] -> Bool

--- Returns True if the expression would be replaced by
--- something bigger than it is now.  For example:
---   abs_vars = tvars only:  return True if e is trivial,
---                           but False for anything bigger
---   abs_vars = [x] (an Id): return True for trivial, or an application (f x)
---                           but False for (f x x)
---
--- One big goal is that floating should be idempotent.  Eg if
--- we replace e with (lvl79 x y) and then run FloatOut again, don't want
--- to replace (lvl79 x y) with (lvl83 x y)!
-
+--  See Note [notWorthFloating]
 notWorthFloating e abs_vars
-  = go e (count isId abs_vars)
+  = go e 0
   where
-    go (Var {}) n               = n >= 0
-    go (Lit lit) n              = assert (n==0) $
-                                  litIsTrivial lit   -- Note [Floating literals]
-    go (Type {}) _              = True
-    go (Coercion {}) _          = True
+    n_abs_vars = count isId abs_vars  -- See (NWF5)
+
+    go :: CoreExpr -> Int -> Bool
+    -- (go e n) return True if (e x1 .. xn) is not worth floating
+    -- where `e` has n trivial value arguments x1..xn
+    -- See (NWF4)
+    go (Lit lit) n         = assert (n==0) $
+                             litIsTrivial lit   -- See (NWF1)
+    go (Type {}) _         = True
+    go (Tick t e) n        = not (tickishIsCode t) && go e n
+    go (Cast e _) n        = n==0 || go e n     -- See (NWF3)
+    go (Coercion {}) _     = True
     go (App e arg) n
-       -- See Note [Floating applications to coercions]
-       | not (isRuntimeArg arg) = go e n
-       | n==0                   = False
-       | exprIsTrivial arg      = go e (n-1) -- NB: exprIsTrivial arg = go arg 0
-       | otherwise              = False
-    go (Tick t e) n             = not (tickishIsCode t) && go e n
-    go (Cast e _)  n            = go e n
-    go (Case e b _ as) n
+       | Type {} <- arg    = go e n    -- Just types, not coercions (NWF2)
+       | exprIsTrivial arg = go e (n+1)
+       | otherwise         = False  -- (f non-triv) is worth floating
+
+    go (Case e b _ as) _
+      -- Do not float the `case` part of trivial cases (NWF3)
+      -- We'll have a look at the RHS when we get there
       | null as
-      = go e n     -- See Note [Empty case is trivial]
-      | Just rhs <- isUnsafeEqualityCase e b as
-      = go rhs n   -- See (U2) of Note [Implementing unsafeCoerce] in base:Unsafe.Coerce
-    go _ _                      = False
+      = True   -- See Note [Empty case is trivial]
+      | Just {} <- isUnsafeEqualityCase e b as
+      = True   -- See (U2) of Note [Implementing unsafeCoerce] in base:Unsafe.Coerce
+      | otherwise
+      = False
 
-{-
-Note [Floating literals]
-~~~~~~~~~~~~~~~~~~~~~~~~
-It's important to float Integer literals, so that they get shared,
-rather than being allocated every time round the loop.
-Hence the litIsTrivial.
+    go (Var _) n
+      | n==0             = True   -- Naked variable
+      | n <= n_abs_vars  = True   -- (f a b c) is not worth floating if
+      | otherwise        = False  -- a,b,c are all abstracted; see (NWF5)
 
-Ditto literal strings (LitString), which we'd like to float to top
-level, which is now possible.
+    go _ _ = False  -- Let etc is worth floating
 
-Note [Floating applications to coercions]
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-We don’t float out variables applied only to type arguments, since the
-extra binding would be pointless: type arguments are completely erased.
-But *coercion* arguments aren’t (see Note [Coercion tokens] in
-"GHC.CoreToStg" and Note [inlineBoringOk] in"GHC.Core.Unfold"),
-so we still want to float out variables applied only to
-coercion arguments.
+{- Note [notWorthFloating]
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+`notWorthFloating` returns True if the expression would be replaced by something
+bigger than it is now.  One big goal is that floating should be idempotent.  Eg
+if we replace e with (lvl79 x y) and then run FloatOut again, don't want to
+replace (lvl79 x y) with (lvl83 x y)!
 
+For example:
+  abs_vars = tvars only:  return True if e is trivial,
+                          but False for anything bigger
+  abs_vars = [x] (an Id): return True for trivial, or an application (f x)
+                          but False for (f x x)
+
+(NWF1) It's important to float Integer literals, so that they get shared, rather
+  than being allocated every time round the loop.  Hence the litIsTrivial.
+
+  Ditto literal strings (LitString), which we'd like to float to top
+  level, which is now possible.
+
+(NWF2) We don’t float out variables applied only to type arguments, since the
+  extra binding would be pointless: type arguments are completely erased.
+  But *coercion* arguments aren’t (see Note [Coercion tokens] in
+  "GHC.CoreToStg" and Note [inlineBoringOk] in"GHC.Core.Unfold"),
+  so we still want to float out variables applied only to
+  coercion arguments.
+
+(NWF3) Some expressions have trivial wrappers:
+     - Casts (e |> co)
+     - Unary-class applications:
+          - Dictionary applications (MkC meth)
+          - Class-op applictions    (op dict)
+     - Case of empty alts
+     - Unsafe-equality case
+  In all these cases we say "not worth floating", and we do so /regardless/
+  of the wrapped expression.  The SetLevels stuff may subsequently float the
+  components of the expression.
+
+  Example:  is it worth floating (f x |> co)?  No!  If we did we'd get
+     lvl = f x |> co
+     ...lvl....
+  Then we'd do cast worker/wrapper and end up with.
+     lvl' = f x
+     ...(lvl' |> co)...
+  Silly!  Better not to float it in the first place.  If we say "no" here,
+  we'll subsequently say "yes" for (f x) and get
+     lvl = f x
+     ....(lvl |> co)...
+  which is what we want.  In short: don't float trivial wrappers.
+
+(NWF4) The only non-trivial expression that we say "not worth floating" for
+  is an application
+             f x y z
+  where the number of value arguments is <= the number of abstracted Ids.
+  This is what makes floating idempotent.  Hence counting the number of
+  value arguments in `go`
+
+(NWF5) In #24471 we had something like
+     x1 = I# 1
+     ...
+     x1000 = I# 1000
+     foo = f x1 (f x2 (f x3 ....))
+  So every sub-expression in `foo` has lots and lots of free variables.  But
+  none of these sub-expressions float anywhere; the entire float-out pass is a
+  no-op.
 
-************************************************************************
-*                                                                      *
-\subsection{Bindings}
-*                                                                      *
-************************************************************************
+  So `notWorthFloating` tries to avoid evaluating `n_abs_vars`, in cases where
+  it obviously /is/ worth floating.  (In #24471 it turned out that we were
+  testing `abs_vars` (a relatively complicated calculation that takes at least
+  O(n-free-vars) time to compute) for every sub-expression.)
 
-The binding stuff works for top level too.
+  Hence testing `n_abs_vars only` at the very end.
 -}
 
+{- *********************************************************************
+*                                                                      *
+                       Bindings
+        This binding stuff works for top level too.
+*                                                                      *
+********************************************************************* -}
+
 lvlBind :: LevelEnv
         -> CoreBindWithFVs
         -> LvlM (LevelledBind, LevelEnv)
@@ -1261,7 +1300,7 @@ lvlBind env (AnnNonRec bndr rhs)
         -- is_bot_lam: looks like (\xy. bot), maybe zero lams
         -- NB: not isBottomThunk!
         -- NB: not is_join: don't send bottoming join points to the top.
-        -- See Note [Bottoming floats] point (3)
+        -- See Note [Bottoming floats] (BF3)
 
     is_top_bindable = exprIsTopLevelBindable deann_rhs bndr_ty
     n_extra       = count isId abs_vars
@@ -1552,9 +1591,8 @@ destLevel env fvs fvs_ty is_function is_bot
                               -- See Note [Floating join point bindings]
   = tOP_LEVEL
 
-  | is_bot              -- Send bottoming bindings to the top
-  = as_far_as_poss      -- regardless; see Note [Bottoming floats]
-                        -- Esp Bottoming floats (1) and (3)
+  | is_bot              -- Send bottoming bindings to the top regardless;
+  = as_far_as_poss      -- see (BF1) and (BF2) in Note [Bottoming floats]
 
   | Just n_args <- floatLams env
   , n_args > 0  -- n=0 case handled uniformly by the 'otherwise' case
@@ -1568,8 +1606,13 @@ destLevel env fvs fvs_ty is_function is_bot
     max_fv_id_level = maxFvLevel isId env fvs -- Max over Ids only; the
                                               -- tyvars will be abstracted
 
+    -- as_far_as_poss: destination level depends only on the free Ids (more
+    -- precisely, free CoVars) of the /type/, not the free Ids of the /term/.
+    -- Why worry about the free CoVars?  See Note [Floating and kind casts]
+    --
+    -- There may be free Ids in the term, but then we'll just
+    -- lambda-abstract over them
     as_far_as_poss = maxFvLevel' isId env fvs_ty
-                     -- See Note [Floating and kind casts]
 
 {- Note [Floating and kind casts]
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1732,10 +1775,9 @@ maxFvLevel max_me env var_set
     -- It's OK to use a non-deterministic fold here because maxIn commutes.
 
 maxFvLevel' :: (Var -> Bool) -> LevelEnv -> TyCoVarSet -> Level
--- Same but for TyCoVarSet
+-- Precisely the same as `maxFvLevel` but for TyCoVarSet rather than DVarSet
 maxFvLevel' max_me env var_set
   = nonDetStrictFoldUniqSet (maxIn max_me env) tOP_LEVEL var_set
-    -- It's OK to use a non-deterministic fold here because maxIn commutes.
 
 maxIn :: (Var -> Bool) -> LevelEnv -> InVar -> Level -> Level
 maxIn max_me (LE { le_lvl_env = lvl_env, le_env = id_env }) in_var lvl


=====================================
compiler/GHC/Core/Opt/Simplify/Env.hs
=====================================
@@ -11,7 +11,7 @@ module GHC.Core.Opt.Simplify.Env (
         SimplMode(..), updMode, smPlatform,
 
         -- * Environments
-        SimplEnv(..), pprSimplEnv,   -- Temp not abstract
+        SimplEnv(..), StaticEnv, pprSimplEnv,
         seArityOpts, seCaseCase, seCaseFolding, seCaseMerge, seCastSwizzle,
         seDoEtaReduction, seEtaExpand, seFloatEnable, seInline, seNames,
         seOptCoercionOpts, sePhase, sePlatform, sePreInline,
@@ -170,6 +170,8 @@ coercion we don't apply optCoercion to it if seInlineDepth>0.
 Reason: it has already been optimised once, no point in doing so again.
 -}
 
+type StaticEnv = SimplEnv       -- Just the static part is relevant
+
 data SimplEnv
   = SimplEnv {
      ----------- Static part of the environment -----------
@@ -407,7 +409,6 @@ data SimplSR
        -- and  ja = Just a <=> x is a join-point of arity a
        -- See Note [Join arity in SimplIdSubst]
 
-
   | DoneId OutId
        -- If  x :-> DoneId v   is in the SimplIdSubst
        -- then replace occurrences of x by v
@@ -778,7 +779,7 @@ emptyJoinFloats = nilOL
 isEmptyJoinFloats :: JoinFloats -> Bool
 isEmptyJoinFloats = isNilOL
 
-unitLetFloat :: OutBind -> LetFloats
+unitLetFloat :: HasDebugCallStack => OutBind -> LetFloats
 -- This key function constructs a singleton float with the right form
 unitLetFloat bind = assert (all (not . isJoinId) (bindersOf bind)) $
                     LetFloats (unitOL bind) (flag bind)


=====================================
compiler/GHC/Core/Opt/Simplify/Iteration.hs
=====================================
@@ -447,7 +447,7 @@ we want to do something very similar to worker/wrapper:
 
 We call this making a cast worker/wrapper in tryCastWorkerWrapper.
 
-The main motivaiton is that x can be inlined freely.  There's a chance
+The main motivation is that x can be inlined freely.  There's a chance
 that e will be a constructor application or function, or something
 like that, so moving the coercion to the usage site may well cancel
 the coercions and lead to further optimisation.  Example:
@@ -576,11 +576,13 @@ Note [Concrete types] in GHC.Tc.Utils.Concrete.
 -}
 
 tryCastWorkerWrapper :: SimplEnv -> BindContext
-                     -> InId -> OutId -> OutExpr
-                     -> SimplM (SimplFloats, SimplEnv)
+                     -> OutId -> OutExpr
+                     -> SimplM (Maybe (LetFloats, OutId, OutExpr))
 -- See Note [Cast worker/wrapper]
-tryCastWorkerWrapper env bind_cxt old_bndr bndr (Cast rhs co)
-  | BC_Let top_lvl is_rec <- bind_cxt  -- Not join points
+-- Given input x = rhs |> co, the result will be
+--    (x' = rhs, x, x' |> co))
+tryCastWorkerWrapper env bind_cxt bndr (Cast rhs co)
+  | BC_Let top_lvl _ <- bind_cxt  -- Not join points
   , not (isDFunId bndr) -- nor DFuns; cast w/w is no help, and we can't transform
                         --            a DFunUnfolding in mk_worker_unfolding
   , not (exprIsTrivial rhs)        -- Not x = y |> co; Wrinkle 1
@@ -588,38 +590,23 @@ tryCastWorkerWrapper env bind_cxt old_bndr bndr (Cast rhs co)
   , typeHasFixedRuntimeRep work_ty    -- Don't peel off a cast if doing so would
                                       -- lose the underlying runtime representation.
                                       -- See Note [Preserve RuntimeRep info in cast w/w]
-  , not (isOpaquePragma (idInlinePragma old_bndr)) -- Not for OPAQUE bindings
-                                                   -- See Note [OPAQUE pragma]
+  , not (isOpaquePragma (idInlinePragma bndr)) -- Not for OPAQUE bindings
+                                               -- See Note [OPAQUE pragma]
   = do  { uniq <- getUniqueM
         ; let work_name = mkSystemVarName uniq occ_fs
               work_id   = mkLocalIdWithInfo work_name ManyTy work_ty work_info
-              is_strict = isStrictId bndr
 
-        ; (rhs_floats, work_rhs) <- prepareBinding env top_lvl is_rec is_strict
-                                                   work_id (emptyFloats env) rhs
-
-        ; work_unf <- mk_worker_unfolding top_lvl work_id work_rhs
+        ; work_unf <- mk_worker_unfolding top_lvl work_id rhs
         ; let  work_id_w_unf = work_id `setIdUnfolding` work_unf
-               floats   = rhs_floats `addLetFloats`
-                          unitLetFloat (NonRec work_id_w_unf work_rhs)
-
-               triv_rhs = Cast (Var work_id_w_unf) co
-
-        ; if postInlineUnconditionally env bind_cxt old_bndr bndr triv_rhs
-             -- Almost always True, because the RHS is trivial
-             -- In that case we want to eliminate the binding fast
-             -- We conservatively use postInlineUnconditionally so that we
-             -- check all the right things
-          then do { tick (PostInlineUnconditionally bndr)
-                  ; return ( floats
-                           , extendIdSubst (setInScopeFromF env floats) old_bndr $
-                             DoneEx triv_rhs NotJoinPoint ) }
-
-          else do { wrap_unf <- mkLetUnfolding env top_lvl VanillaSrc bndr False triv_rhs
-                  ; let bndr' = bndr `setInlinePragma` mkCastWrapperInlinePrag (idInlinePragma bndr)
-                                `setIdUnfolding`  wrap_unf
-                        floats' = floats `extendFloats` NonRec bndr' triv_rhs
-                  ; return ( floats', setInScopeFromF env floats' ) } }
+               work_bind     = NonRec work_id_w_unf rhs
+               triv_rhs      = Cast (Var work_id_w_unf) co
+
+        ; wrap_unf <- mkLetUnfolding env top_lvl VanillaSrc bndr False triv_rhs
+        ; let wrap_prag = mkCastWrapperInlinePrag (inlinePragInfo info)
+              bndr' = bndr `setInlinePragma` wrap_prag
+                           `setIdUnfolding`  wrap_unf
+
+        ; return (Just (unitLetFloat work_bind, bndr', triv_rhs)) }
   where
     -- Force the occ_fs so that the old Id is not retained in the new Id.
     !occ_fs = getOccFS bndr
@@ -647,10 +634,10 @@ tryCastWorkerWrapper env bind_cxt old_bndr bndr (Cast rhs co)
              | isStableSource src -> return (unf { uf_tmpl = mkCast unf_rhs (mkSymCo co) })
            _ -> mkLetUnfolding env top_lvl VanillaSrc work_id False work_rhs
 
-tryCastWorkerWrapper env _ _ bndr rhs  -- All other bindings
+tryCastWorkerWrapper _ _ bndr rhs  -- All other bindings
   = do { traceSmpl "tcww:no" (vcat [ text "bndr:" <+> ppr bndr
                                    , text "rhs:" <+> ppr rhs ])
-        ; return (mkFloatBind env (NonRec bndr rhs)) }
+       ; return Nothing }
 
 mkCastWrapperInlinePrag :: InlinePragma -> InlinePragma
 -- See Note [Cast worker/wrapper]
@@ -810,39 +797,40 @@ makeTrivial :: HasDebugCallStack
 -- Binds the expression to a variable, if it's not trivial, returning the variable
 -- For the Demand argument, see Note [Keeping demand info in StrictArg Plan A]
 makeTrivial env top_lvl dmd occ_fs expr
-  | exprIsTrivial expr                          -- Already trivial
-  || not (bindingOk top_lvl expr expr_ty)       -- Cannot trivialise
-                                                --   See Note [Cannot trivialise]
+  | exprIsTrivial expr                  -- Already trivial
   = return (emptyLetFloats, expr)
 
-  | Cast expr' co <- expr
-  = do { (floats, triv_expr) <- makeTrivial env top_lvl dmd occ_fs expr'
-       ; return (floats, Cast triv_expr co) }
+  | not (bindingOk top_lvl expr expr_ty)  -- Cannot trivialise
+  = return (emptyLetFloats, expr)         -- See Note [Cannot trivialise]
 
-  | otherwise -- 'expr' is not of form (Cast e co)
+  | otherwise
   = do  { (floats, expr1) <- prepareRhs env top_lvl occ_fs expr
         ; uniq <- getUniqueM
         ; let name = mkSystemVarName uniq occ_fs
-              var  = mkLocalIdWithInfo name ManyTy expr_ty id_info
+              bndr = mkLocalIdWithInfo name ManyTy expr_ty id_info
+              bind_ctxt = BC_Let top_lvl NonRecursive
 
         -- Now something very like completeBind,
         -- but without the postInlineUnconditionally part
-        ; (arity_type, expr2) <- tryEtaExpandRhs env (BC_Let top_lvl NonRecursive) var expr1
+        ; (arity_type, expr2) <- tryEtaExpandRhs env bind_ctxt bndr expr1
           -- Technically we should extend the in-scope set in 'env' with
           -- the 'floats' from prepareRHS; but they are all fresh, so there is
           -- no danger of introducing name shadowing in eta expansion
 
-        ; unf <- mkLetUnfolding env top_lvl VanillaSrc var False expr2
-
-        ; let final_id = addLetBndrInfo var arity_type unf
-              bind     = NonRec final_id expr2
+        ; unf <- mkLetUnfolding env top_lvl VanillaSrc bndr False expr2
+        ; let bndr'    = addLetBndrInfo bndr arity_type unf
+              anf_bind = NonRec bndr' expr2
 
-        ; traceSmpl "makeTrivial" (vcat [text "final_id" <+> ppr final_id, text "rhs" <+> ppr expr2 ])
-        ; return ( floats `addLetFlts` unitLetFloat bind, Var final_id ) }
+        ; mb_cast_ww <- tryCastWorkerWrapper env bind_ctxt bndr' expr2
+        ; case mb_cast_ww of
+            Nothing -> return (floats `addLetFlts` unitLetFloat anf_bind, Var bndr')
+            Just (work_flts, _, triv_rhs)
+                    -> return (floats `addLetFlts` work_flts, triv_rhs) }
   where
     id_info = vanillaIdInfo `setDemandInfo` dmd
     expr_ty = exprType expr
 
+
 bindingOk :: TopLevelFlag -> CoreExpr -> Type -> Bool
 -- True iff we can have a binding of this expression at this level
 -- Precondition: the type is the type of the expression
@@ -936,26 +924,50 @@ completeBind bind_cxt (old_bndr, unf_se) (new_bndr, new_rhs, env)
                             eta_rhs (idType new_bndr) new_arity old_unf
 
       ; let new_bndr_w_info = addLetBndrInfo new_bndr new_arity new_unfolding
-        -- See Note [In-scope set as a substitution]
+               -- See Note [In-scope set as a substitution]
+            occ_anald_rhs = maybeUnfoldingTemplate new_unfolding `orElse` eta_rhs
+               -- occ_anald_rhs: see Note [Use occ-anald RHS in postInlineUnconditionally]
 
+      -- Try postInlineUnconditionally for (x = rhs)
+      -- If that succeeds we don't want to do tryCastWorkerWrapper
       ; if postInlineUnconditionally env bind_cxt old_bndr new_bndr_w_info eta_rhs
-
-        then -- Inline and discard the binding
-             do  { tick (PostInlineUnconditionally old_bndr)
-                 ; let unf_rhs = maybeUnfoldingTemplate new_unfolding `orElse` eta_rhs
-                          -- See Note [Use occ-anald RHS in postInlineUnconditionally]
-                 ; simplTrace "PostInlineUnconditionally" (ppr new_bndr <+> ppr unf_rhs) $
-                   return ( emptyFloats env
-                          , extendIdSubst env old_bndr $
-                            DoneEx unf_rhs (idJoinPointHood new_bndr)) }
+        then post_inline_it (emptyFloats env) occ_anald_rhs
+        else
+
+   do { -- Try cast worker-wrapper
+        mb_cast_ww <- tryCastWorkerWrapper env bind_cxt new_bndr_w_info eta_rhs
+      ; case mb_cast_ww of
+          Nothing -> no_post_inline (emptyFloats env) new_bndr_w_info eta_rhs
+
+          Just (cast_let_flts, new_bndr, new_rhs)
+            -- Try postInlineUnconditionally for (new_bndr = new_rhs)
+            -- Almost always fires, because `new_rhs` is small, but we conservatively
+            -- use `postInlineUnconditionally` so that we check all the right things
+            | postInlineUnconditionally env bind_cxt old_bndr new_bndr new_rhs
+            -> post_inline_it cast_floats new_rhs
+                   -- new_rhs is (x |> co) so no need to occ-anal
+            | otherwise
+            -> no_post_inline cast_floats new_bndr new_rhs
+            where
+              cast_floats = emptyFloats env `addLetFloats` cast_let_flts
+    } }
+  where
+    no_post_inline floats new_bndr new_rhs
+      = do { let the_bind = NonRec new_bndr new_rhs
+                 floats'  = floats `extendFloats` the_bind
+                 env'     = env `setInScopeFromF` floats'
+           ; return (floats', env') }
+
+    post_inline_it floats rhs
+      = do  { simplTrace "PostInlineUnconditionally" (ppr old_bndr <+> ppr rhs) $
+              tick (PostInlineUnconditionally old_bndr)
+            ; let env' = env `setInScopeFromF` floats
+            ; return ( floats
+                     , extendIdSubst env' old_bndr $
+                       DoneEx rhs (idJoinPointHood old_bndr)) }
                 -- Use the substitution to make quite, quite sure that the
                 -- substitution will happen, since we are going to discard the binding
 
-        else -- Keep the binding; do cast worker/wrapper
---             simplTrace "completeBind" (vcat [ text "bndrs" <+> ppr old_bndr <+> ppr new_bndr
---                                             , text "eta_rhs" <+> ppr eta_rhs ]) $
-             tryCastWorkerWrapper env bind_cxt old_bndr new_bndr_w_info eta_rhs }
-
 addLetBndrInfo :: OutId -> ArityType -> Unfolding -> OutId
 addLetBndrInfo new_bndr new_arity_type new_unf
   = new_bndr `setIdInfo` info5
@@ -3955,7 +3967,17 @@ mkDupableContWithDmds env dmds
         ; (floats1, cont') <- mkDupableContWithDmds env cont_dmds cont
         ; let env' = env `setInScopeFromF` floats1
         ; (_, se', arg') <- simplLazyArg env' dup hole_ty Nothing se arg
-        ; (let_floats2, arg'') <- makeTrivial env NotTopLevel dmd (fsLit "karg") arg'
+
+        -- Make the argument duplicable. Danger: if arg is small and we let-bind
+        -- it, then postInlineUnconditionally will just inline it again, perhaps
+        -- taking an extra Simplifier iteration (e.g. in test T21839c). So make
+        -- a `let` only if `couldBeSmallEnoughToInline` says that it is big enough
+        ; let uf_opts = seUnfoldingOpts env
+        ; (let_floats2, arg'')
+              <- if couldBeSmallEnoughToInline uf_opts (unfoldingUseThreshold uf_opts) arg'
+                 then return (emptyLetFloats, arg')
+                 else makeTrivial env NotTopLevel dmd (fsLit "karg") arg'
+
         ; let all_floats = floats1 `addLetFloats` let_floats2
         ; return ( all_floats
                  , ApplyToVal { sc_arg = arg''
@@ -4592,7 +4614,8 @@ mkLetUnfolding :: SimplEnv -> TopLevelFlag -> UnfoldingSource
                -> InId -> Bool    -- True <=> this is a join point
                -> OutExpr -> SimplM Unfolding
 mkLetUnfolding env top_lvl src id is_join new_rhs
-  = return (mkUnfolding uf_opts src is_top_lvl is_bottoming is_join new_rhs Nothing)
+  = -- Monadic to force those where-bindings
+    return (mkUnfolding uf_opts src is_top_lvl is_bottoming is_join new_rhs Nothing)
             -- We make an  unfolding *even for loop-breakers*.
             -- Reason: (a) It might be useful to know that they are WHNF
             --         (b) In GHC.Iface.Tidy we currently assume that, if we want to


=====================================
compiler/GHC/Core/Opt/Simplify/Utils.hs
=====================================
@@ -216,8 +216,6 @@ data SimplCont
         CoreTickish     -- Tick tickish <hole>
         SimplCont
 
-type StaticEnv = SimplEnv       -- Just the static part is relevant
-
 data FromWhat = FromLet | FromBeta Levity
 
 -- See Note [DupFlag invariants]
@@ -723,7 +721,6 @@ which it is on the LHS of a rule (see updModeForRules), then don't
 make use of the strictness info for the function.
 -}
 
-
 {-
 ************************************************************************
 *                                                                      *
@@ -1423,8 +1420,12 @@ preInlineUnconditionally for
 Note [Top-level bottoming Ids]
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Don't inline top-level Ids that are bottoming, even if they are used just
-once, because FloatOut has gone to some trouble to extract them out.
-Inlining them won't make the program run faster!
+once, because FloatOut has gone to some trouble to extract them out. e.g.
+    report x y = error (..lots of stuff...)
+    f x y z = if z then report x y else ...blah...
+Here `f` might be small enough to inline; but if we put all the `report`
+stuff inside it, it'll look to big.  In general we don't want to duplicate
+all the error-reporting goop.
 
 Note [Do not inline CoVars unconditionally]
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1460,51 +1461,25 @@ preInlineUnconditionally env top_lvl bndr rhs rhs_env
     extend_subst_with inl_rhs = extendIdSubst env bndr $! (mkContEx rhs_env inl_rhs)
 
     one_occ IAmDead = True -- Happens in ((\x.1) v)
-    one_occ OneOcc{ occ_n_br   = 1
-                  , occ_in_lam = NotInsideLam }   = isNotTopLevel top_lvl || early_phase
+    one_occ OneOcc{ occ_n_br    = 1
+                  , occ_in_lam  = NotInsideLam
+                  , occ_int_cxt = int_cxt }
+      =  isNotTopLevel top_lvl      -- Get rid of allocation
+      || (int_cxt==IsInteresting)   -- Function is applied
+   --   || (early_phase && not (isConLikeUnfolding unf))  -- See early_phase
     one_occ OneOcc{ occ_n_br   = 1
                   , occ_in_lam = IsInsideLam
-                  , occ_int_cxt = IsInteresting } = canInlineInLam rhs
-    one_occ _                                     = False
+                  , occ_int_cxt = IsInteresting }
+      = canInlineInLam rhs
+    one_occ _
+      = False
 
     pre_inline_unconditionally = sePreInline env
     active = isActive (sePhase env) (inlinePragmaActivation inline_prag)
              -- See Note [pre/postInlineUnconditionally in gentle mode]
     inline_prag = idInlinePragma bndr
 
--- Be very careful before inlining inside a lambda, because (a) we must not
--- invalidate occurrence information, and (b) we want to avoid pushing a
--- single allocation (here) into multiple allocations (inside lambda).
--- Inlining a *function* with a single *saturated* call would be ok, mind you.
---      || (if is_cheap && not (canInlineInLam rhs) then pprTrace "preinline" (ppr bndr <+> ppr rhs) ok else ok)
---      where
---              is_cheap = exprIsCheap rhs
---              ok = is_cheap && int_cxt
-
-        --      int_cxt         The context isn't totally boring
-        -- E.g. let f = \ab.BIG in \y. map f xs
-        --      Don't want to substitute for f, because then we allocate
-        --      its closure every time the \y is called
-        -- But: let f = \ab.BIG in \y. map (f y) xs
-        --      Now we do want to substitute for f, even though it's not
-        --      saturated, because we're going to allocate a closure for
-        --      (f y) every time round the loop anyhow.
-
-        -- canInlineInLam => free vars of rhs are (Once in_lam) or Many,
-        -- so substituting rhs inside a lambda doesn't change the occ info.
-        -- Sadly, not quite the same as exprIsHNF.
-    canInlineInLam (Lit _)    = True
-    canInlineInLam (Lam b e)  = isRuntimeVar b || canInlineInLam e
-    canInlineInLam (Tick t e) = not (tickishIsCode t) && canInlineInLam e
-    canInlineInLam (Var v)    = case idOccInfo v of
-                                  OneOcc { occ_in_lam = IsInsideLam } -> True
-                                  ManyOccs {}                         -> True
-                                  _                                   -> False
-    canInlineInLam _          = False
-      -- not ticks.  Counting ticks cannot be duplicated, and non-counting
-      -- ticks around a Lam will disappear anyway.
-
-    early_phase = sePhase env /= FinalPhase
+--    early_phase = sePhase env /= FinalPhase
     -- If we don't have this early_phase test, consider
     --      x = length [1,2,3]
     -- The full laziness pass carefully floats all the cons cells to
@@ -1532,6 +1507,52 @@ preInlineUnconditionally env top_lvl bndr rhs rhs_env
     -- (Nor can we check for `exprIsExpandable rhs`, because that needs to look
     -- at the non-existent unfolding for the `I# 2#` which is also floated out.)
 
+
+-- Be very careful before inlining inside a lambda, because (a) we must not
+-- invalidate occurrence information, and (b) we want to avoid pushing a
+-- single allocation (here) into multiple allocations (inside lambda).
+-- Inlining a *function* with a single *saturated* call would be ok, mind you.
+--      || (if is_cheap && not (canInlineInLam rhs) then pprTrace "preinline" (ppr bndr <+> ppr rhs) ok else ok)
+--      where
+--              is_cheap = exprIsCheap rhs
+--              ok = is_cheap && int_cxt
+
+        --      int_cxt         The context isn't totally boring
+        -- E.g. let f = \ab.BIG in \y. map f xs
+        --      Don't want to substitute for f, because then we allocate
+        --      its closure every time the \y is called
+        -- But: let f = \ab.BIG in \y. map (f y) xs
+        --      Now we do want to substitute for f, even though it's not
+        --      saturated, because we're going to allocate a closure for
+        --      (f y) every time round the loop anyhow.
+
+        -- canInlineInLam => free vars of rhs are (Once in_lam) or Many,
+        -- so substituting rhs inside a lambda doesn't change the occ info.
+        -- Sadly, not quite the same as exprIsHNF.
+canInlineInLam ::CoreExpr -> Bool
+canInlineInLam e
+  = go e
+  where
+    go (Lit _)    = True
+    go (Lam b e)  = isRuntimeVar b || go e
+    go (Cast e _) = go e
+    go (Tick t e) = not (tickishIsCode t) && go e
+    -- This matters only for:
+    --     x = y            -- or y|>co
+    --     f = \p. ..x..    -- One occurrence of x
+    --     ..y..            -- Multiple other occurrences of y
+    -- Then it is safe to inline x unconditionally
+    -- For postInlineUncondionally we have already tested exprIsTrivial
+    -- so this Var case never arises
+    go (Var v)    = case idOccInfo v of
+                       OneOcc { occ_in_lam = IsInsideLam } -> True
+                       ManyOccs {}                         -> True
+                       _                                   -> False
+    go _          = False
+      -- not ticks.  Counting ticks cannot be duplicated, and non-counting
+      -- ticks around a Lam will disappear anyway.
+
+
 {-
 ************************************************************************
 *                                                                      *
@@ -1582,71 +1603,77 @@ postInlineUnconditionally
 -- Reason: we don't want to inline single uses, or discard dead bindings,
 --         for unlifted, side-effect-ful bindings
 postInlineUnconditionally env bind_cxt old_bndr bndr rhs
-  | not active                  = False
-  | isWeakLoopBreaker occ_info  = False -- If it's a loop-breaker of any kind, don't inline
-                                        -- because it might be referred to "earlier"
-  | isStableUnfolding unfolding = False -- Note [Stable unfoldings and postInlineUnconditionally]
-  | isTopLevel (bindContextLevel bind_cxt)
-                                = False -- Note [Top level and postInlineUnconditionally]
-  | exprIsTrivial rhs           = True
-  | BC_Join {} <- bind_cxt      = False -- See point (1) of Note [Duplicating join points]
+  | not active                   = False
+  | isWeakLoopBreaker occ_info   = False -- If it's a loop-breaker of any kind, don't inline
+                                         -- because it might be referred to "earlier"
+  | isStableUnfolding unfolding  = False -- Note [Stable unfoldings and postInlineUnconditionally]
+  | BC_Join {} <- bind_cxt       = exprIsTrivial rhs
+                                        -- See point (DJ1) of Note [Duplicating join points]
                                         --     in GHC.Core.Opt.Simplify.Iteration
+  | is_top_lvl, isDeadEndId bndr = False -- Note [Top-level bottoming Ids]
   | otherwise
   = case occ_info of
-      OneOcc { occ_in_lam = in_lam, occ_int_cxt = int_cxt, occ_n_br = n_br }
-        -- See Note [Inline small things to avoid creating a thunk]
+      IAmALoopBreaker {} -> False
+      ManyOccs {} | is_top_lvl -> False  -- Note [Top level and postInlineUnconditionally]
+                  | otherwise  -> exprIsTrivial rhs
 
-        | n_br >= 100 -> False  -- See #23627
+      OneOcc { occ_in_lam = in_lam, occ_int_cxt = int_cxt, occ_n_br = n_br }
+              | exprIsTrivial rhs -> True
+              | otherwise         -> check_one_occ in_lam int_cxt n_br
 
-        | n_br == 1, NotInsideLam <- in_lam  -- One syntactic occurrence
-        -> True                              -- See Note [Post-inline for single-use things]
+      IAmDead -> True   -- This happens; for example, the case_bndr during case of
+                        -- known constructor:  case (a,b) of x { (p,q) -> ... }
+                        -- Here x isn't mentioned in the RHS, so we don't want to
+                        -- create the (dead) let-binding  let x = (a,b) in ...
+  where
+    is_top_lvl  = isTopLevel (bindContextLevel bind_cxt)
+    is_demanded = isStrUsedDmd (idDemandInfo bndr)
+    occ_info    = idOccInfo old_bndr
+    unfolding   = idUnfolding bndr
+    arity       = idArity bndr
+--    is_cheap    = isCheapUnfolding unfolding
+    uf_opts     = seUnfoldingOpts env
+    phase       = sePhase env
+    active      = isActive phase (idInlineActivation bndr)
+        -- See Note [pre/postInlineUnconditionally in gentle mode]
 
+    -- Check for code-size blow-up from inlining in multiple places
+    code_dup_ok n_br
+      | n_br == 1   = True  -- No duplication
+      | n_br >= 100 = False -- See #23627
+      | is_demanded = False -- Demanded => no allocation (it'll be a case expression
+                            -- in the end) so inlining duplicates code but nothing more
+      | otherwise   = smallEnoughToInline uf_opts unfolding
+
+    -- See Note [Post-inline for single-use things]
+    check_one_occ NotInsideLam NotInteresting n_br = not is_top_lvl && code_dup_ok n_br
+    check_one_occ NotInsideLam IsInteresting  n_br = code_dup_ok n_br
+    check_one_occ IsInsideLam  NotInteresting _    = False
+    check_one_occ IsInsideLam  IsInteresting  n_br = arity > 0 && code_dup_ok n_br
+      -- IsInteresting: inlining inside a lambda only with good reason
+      --    See the notes on int_cxt in preInlineUnconditionally
+      -- arity>0: do not inline data strutures under lambdas, only functions
+
+---------------
+-- A wrong bit of code, left here in case you are tempted to do this
 --        | is_unlifted                        -- Unlifted binding, hence ok-for-spec
 --        -> True                              -- hence cheap to inline probably just a primop
---                                             -- Not a big deal either way
 -- No, this is wrong.  {v = p +# q; x = K v}.
 -- Don't inline v; it'll just get floated out again. Stupid.
+---------------
 
-        | is_demanded
-        -> False                            -- No allocation (it'll be a case expression in the end)
-                                            -- so inlining duplicates code but nothing more
 
-        | otherwise
-        -> work_ok in_lam int_cxt && smallEnoughToInline uf_opts unfolding
-              -- Multiple syntactic occurences; but lazy, and small enough to dup
-              -- ToDo: consider discount on smallEnoughToInline if int_cxt is true
-
-      IAmDead -> True   -- This happens; for example, the case_bndr during case of
-                        -- known constructor:  case (a,b) of x { (p,q) -> ... }
-                        -- Here x isn't mentioned in the RHS, so we don't want to
-                        -- create the (dead) let-binding  let x = (a,b) in ...
-
-      _ -> False
-
-  where
-    work_ok NotInsideLam _              = True
-    work_ok IsInsideLam  IsInteresting  = isCheapUnfolding unfolding
-    work_ok IsInsideLam  NotInteresting = False
-      -- NotInsideLam: outside a lambda, we want to be reasonably aggressive
-      -- about inlining into multiple branches of case
+      -- NotInsideLam: outside a lambda, when not at top-level we want to be
+      -- reasonably aggressive about inlining into multiple branches of case
       -- e.g. let x = <non-value>
       --      in case y of { C1 -> ..x..; C2 -> ..x..; C3 -> ... }
       -- Inlining can be a big win if C3 is the hot-spot, even if
       -- the uses in C1, C2 are not 'interesting'
       -- An example that gets worse if you add int_cxt here is 'clausify'
 
-      -- InsideLam: check for acceptable work duplication, using isCheapUnfoldign
-      -- int_cxt to prevent us inlining inside a lambda without some
-      -- good reason.  See the notes on int_cxt in preInlineUnconditionally
+      -- InsideLam:
 
 --    is_unlifted = isUnliftedType (idType bndr)
-    is_demanded = isStrUsedDmd (idDemandInfo bndr)
-    occ_info    = idOccInfo old_bndr
-    unfolding   = idUnfolding bndr
-    uf_opts     = seUnfoldingOpts env
-    phase       = sePhase env
-    active      = isActive phase (idInlineActivation bndr)
-        -- See Note [pre/postInlineUnconditionally in gentle mode]
 
 {- Note [Inline small things to avoid creating a thunk]
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1669,23 +1696,24 @@ where GHC.Driver.CmdLine.$wprocessArgs allocated hugely more.
 Note [Post-inline for single-use things]
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 If we have
-
    let x = rhs in ...x...
-
 and `x` is used exactly once, and not inside a lambda, then we will usually
 preInlineUnconditinally. But we can still get this situation in
 postInlineUnconditionally:
-
   case K rhs of K x -> ...x....
-
 Here we'll use `simplAuxBind` to bind `x` to (the already-simplified) `rhs`;
 and `x` is used exactly once.  It's beneficial to inline right away; otherwise
 we risk creating
-
    let x = rhs in ...x...
+which will take another iteration of the Simplifier to eliminate.
 
-which will take another iteration of the Simplifier to eliminate.  We do this in
-two places
+A similar, but less frequent, case is
+    let f = \x.blah in ...(\y. ...(f e)...) ...
+Again `preInlineUnconditionally will usually inline `f`, but it can arise
+via `simplAuxBind` if we have something like
+    (\f \y. ...(f e)..) (\x.blah)
+
+We do unconditional post-inlining in two places:
 
 1. In the full `postInlineUnconditionally` look for the special case
    of "one occurrence, not under a lambda", and inline unconditionally then.
@@ -1714,24 +1742,20 @@ Alas!
 
 Note [Top level and postInlineUnconditionally]
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-We don't do postInlineUnconditionally for top-level things (even for
-ones that are trivial):
+We must take care when considering postInlineUnconditionally for top-level things
 
-  * Doing so will inline top-level error expressions that have been
-    carefully floated out by FloatOut.  More generally, it might
-    replace static allocation with dynamic.
+  * Don't inline top-level error expressions that have been carefully floated
+    out by FloatOut.  See Note [Top-level bottoming Ids].
 
-  * Even for trivial expressions there's a problem.  Consider
+  * Even for trivial expressions we need to take care: we must not
+    postInlineUnconditionally a top-level ManyOccs binder, even if its
+    RHS is trivial. Consider
       {-# RULE "foo" forall (xs::[T]). reverse xs = ruggle xs #-}
       blah xs = reverse xs
       ruggle = sort
-    In one simplifier pass we might fire the rule, getting
+    We must not postInlineUnconditionally `ruggle`, because in the same
+    simplifier pass we might fire the rule, getting
       blah xs = ruggle xs
-    but in *that* simplifier pass we must not do postInlineUnconditionally
-    on 'ruggle' because then we'll have an unbound occurrence of 'ruggle'
-
-    If the rhs is trivial it'll be inlined by callSiteInline, and then
-    the binding will be dead and discarded by the next use of OccurAnal
 
   * There is less point, because the main goal is to get rid of local
     bindings used in multiple case branches.


=====================================
compiler/GHC/Core/Utils.hs
=====================================
@@ -397,10 +397,12 @@ mkTicks :: [CoreTickish] -> CoreExpr -> CoreExpr
 mkTicks ticks expr = foldr mkTick expr ticks
 
 isSaturatedConApp :: CoreExpr -> Bool
+-- Also includes literals
 isSaturatedConApp e = go e []
   where go (App f a) as = go f (a:as)
         go (Var fun) args
            = isConLikeId fun && idArity fun == valArgCount args
+        go (Lit {})   _  = True
         go (Cast f _) as = go f as
         go _ _ = False
 


=====================================
libraries/ghc-internal/src/GHC/Internal/Base.hs
=====================================
@@ -1809,7 +1809,12 @@ build g = g (:) []
 
 augment :: forall a. (forall b. (a->b->b) -> b -> b) -> [a] -> [a]
 {-# INLINE [1] augment #-}
-augment g xs = g (:) xs
+-- Give it one argument so that it inlines with one arg
+-- But (crucially) the body is a lambda so that `g` is visibly applied
+-- to two args, and hence we know that in a call
+--      augment (\c n. blah)
+-- both c and n are OneShot
+augment g = \xs -> g (:) xs
 
 {-# RULES
 "fold/build"    forall k z (g::forall b. (a->b->b) -> b -> b) .
@@ -1975,7 +1980,7 @@ The rules for map work like this.
 "++/literal_utf8" forall x. (++) (unpackCStringUtf8# x) = unpackAppendCStringUtf8# x #-}
 
 {-# RULES
-"++"    [~1] forall xs ys. xs ++ ys = augment (\c n -> foldr c n xs) ys
+"++"    [~1] forall xs. (++) xs = augment (\c n -> foldr c n xs)
   #-}
 
 


=====================================
libraries/ghc-internal/src/GHC/Internal/Float.hs
=====================================
@@ -5,6 +5,7 @@
            , MagicHash
            , UnboxedTuples
            , UnliftedFFITypes
+           , TypeApplications
   #-}
 {-# LANGUAGE CApiFFI #-}
 -- We believe we could deorphan this module, by moving lots of things
@@ -1696,6 +1697,16 @@ showSignedFloat showPos p x
        = showParen (p > 6) (showChar '-' . showPos (-x))
    | otherwise = showPos x
 
+
+-- Speicialise showSignedFloat for (a) the type and (b) the argument function
+-- The particularly targets are the calls in `instance Show Float` and
+--     `instance Show Double`
+-- Specialising for both (a) and (b) is obviously more efficient; and if you
+-- don't you find that the `x` argument is strict, but boxed, and that can cause
+-- functions calling showSignedFloat to have box their argument.
+{-# SPECIALISE showSignedFloat @Float  showFloat #-}
+{-# SPECIALISE showSignedFloat @Double showFloat #-}
+
 {-
 We need to prevent over/underflow of the exponent in encodeFloat when
 called from scaleFloat, hence we clamp the scaling parameter.


=====================================
testsuite/tests/simplCore/should_run/simplrun009.hs
=====================================
@@ -6,7 +6,10 @@
 -- It produces a nested unfold that should look something
 -- like the code below.  Note the 'lvl1_shW'.  It is BAD
 -- if this is a lambda instead; you get a lot more allocation
--- See Note [Saving allocation] in GHC.Core.Opt.SetLevels
+--
+-- LATER (2025): But in the end it seems better NOT to float lambdas,
+-- unless they go to top level.
+-- See (SW2) in Note [Saving work] in GHC.Core.Opt.SetLevels
 
 
 {-



View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/536aa79dddeb1627340a50809381b55...

-- 
View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/536aa79dddeb1627340a50809381b55...
You're receiving this email because of your account on gitlab.haskell.org.

    

Simon Peyton Jones (＠simonpj)

tags

participants (1)