Re: [GHC] #12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code Flow not lifted outside the loop...

10 Dec 2016

      #12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code
Flow not lifted outside the loop...
-------------------------------------+-------------------------------------
        Reporter:  GordonBGood       |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler          |              Version:  8.0.1
      Resolution:                    |             Keywords:  JoinPoints
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by GordonBGood):

 Replying to [comment:14 simonpj]:
...
...
...
In SLOW, `nxtc` is represented by a heap-allocated closure, whereas in
 FAST `nxtc` is a join point, and hence not allocated at all (you can see
 that from the `let-no-escape` in the STG).  See our paper
 [https://www.microsoft.com/en-us/research/publication/compiling-without-
 continuations/ Compiling without continuations], and [wiki:SequentCore].
...
Notice that in SLOW, the call to `nxtc s` is ''followed by'' a call to
 `nxtp (p+1)`.  But in FAST we move that call right into `nxtc` itself, in
Yes, I saw that.

 the return branch of the `if`.  That's what makes `nxtc` into a join
 point.

 I had wondered if the work on join points might help here...
...
(None of this has anything to do with non-strictness, incidentally.)
...
This is a rather non-trivial transformation. You clearly think it's a
...
Happily, though, our new Core-with-join-point (described in the paper)
 should catch this nicely.  If we start with SLOW, after a bit we'll get
I just wondered if strictness might be a clue, as the FAST version
 consumes very low heap, whereas the SLOW version takes an amount of heap
 which is about the calculated amount for numLOOPS times the number of
 primes culled...

 pretty obvious optimisation, but it doesn't look obvious to me.

 No, I didn't think it was trivial, just that the optimisation gets
 triggered in the one case; I couldn't see why it couldn't be extended to
 the other...

 this
...
{{{
let nxtc c s = if c > top
               then (# s,c #)
               else case ... of (# s',p #) -> nxtc (c+p) s' }
in case nxtc ss s of
     (# s', r #) -> nxtp (p + 1) s' }
}}}
Now if we do float-in, to move the `let nxtc` in to the scruintee of the
 case, it becomes a join point, and join-point analysis should find it.
 After that, the transformations in the paper will turn it into FAST.
The example in comment:10 looks similar:
{{{
  case doit adr# sp# of sd# -> cullp (p + 2) sd# in cullp 1 s4# }}}}}
}}}
Here again, `doit` will become a join point if we float it in.  The
 example in the Descrption is too big for me to analyse.
Don't worry about the big example in the description as if the small
 example in comment 10 and this latest example in comment 12 get fixed, I'm
 pretty sure it will take core of that larger case too.
...
I've cc'd Luke Maurer who is implementing Core with join points; this
 looks like another good example.  (cf #12781)
That is encouraging news the the Join Point Analysis should catch this.
 Thanks for looking at this, as I think it important in many cases beyond
 this.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:15
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler