Re: [GHC] #15560: Full laziness destroys opportunities for join points

24 Aug 2018

      #15560: Full laziness destroys opportunities for join points
-------------------------------------+-------------------------------------
        Reporter:  AndreasK          |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.6.1
       Component:  Compiler          |              Version:  8.4.3
  (CodeGen)                          |
      Resolution:                    |             Keywords:  JoinPoints
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #14287            |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by AndreasK):

 If j becomes a top level binding we use the general calling convention.
 Which at the assembly level is still a jump as you said.
 However there are a subtle differences between jumping to top level
 bindings versus jumping into a basic block which can have a major
 performance impact.

 Things I can immediatly think of are:

 * If we jump a top level symbol we can't place the jump target immediately
 after the caller.
   This means we:
   * Can't eliminate one of the jump instructions, so they take up resource
 for branch prediction and need to be executed by the CPU.
   * The code won't be placed sequentially in memory leading to worse cache
 utilization.
 * Top level bindings require an additional info table compared to a
 regular jump target. This means more code size which is never a good
 thing.
 * Being a top level function that uses the stack `j` now performs a stack
 check. For very small functions this can be a lot of overhead.

 It's quite possible that in the general case more inlining is offsetting
 this cost, but in some cases this makes a major difference.

 For example the program below has ~7% speedup when disabling full
 laziness(780 vs 730ms).

 {{{
 #!haskell
 --Simpler core to read without worker/wrapper
 {-# OPTIONS_GHC -fno-full-laziness -fno-worker-wrapper #-}
 {-# LANGUAGE MagicHash, BangPatterns #-}

 module Main where

 import System.Environment
 import GHC.Prim

 data T = A | B | C

 -- If we inline the functions case of known constructors kicks in.
 -- Which is good! But means j becomes small enough to be inlined
 -- and won't become an join point. So for this example we don't
 -- want that.
 {-# NOINLINE n #-}
 {-# NOINLINE f #-}
 n :: T -> T
 n A = B
 n B = C
 n _ = A

 toInt :: T -> Int
 toInt A = 1
 toInt B = 2
 toInt C = 3

 f :: Int -> T -> T -> T
 f sel x y =
     -- function large enough to avoid being simply inlined
     let j z = n . n . n . n . n . n $ z
     in case sel of
         -- j is always tailcalled
         0   -> j x
         _   -> j y

 main = do
     print $ sum . map toInt . map (\n -> f n A B) $ [0..50000000]
 }}}

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15560#comment:5
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler