Re: [GHC] #12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop...

6 Nov 2016


      #12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted
outside the loop...
-------------------------------------+-------------------------------------
        Reporter:  GordonBGood       |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler          |              Version:  8.0.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by GordonBGood:

@@ -273,2 +273,4 @@
-         cmpq    %r14, %rbx
-         jbe     .LBB35_1
+         cmpq    %r14, %rbx              # rbx clobbered here, but old
+ value
+         jbe     .LBB35_1                # never used again until replaced
+ after loop

New description:

 '''Background:'''  I've been intrigued investigating whether GHC can
 produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and
 have been using the very tight inner composite culling loops (purely
 integer operations) of a basic Sieve of Eratosthenes implementation as a
 test vehicle.

 '''Synopsis:'''  This is a follow-on of the work leading to finding the
 efficiency problem described in ticket #12798, but involves pushing the
 speed even further as per the method described for "primesieve as per
 [http://primesieve.org/] in the "Highly optimized inner loop" section.

 '''Description of test code:'''   Essentially, this method involves
 extreme loop unrolling with very imperative code although coded
 functionally; in the case of the following code it means that, recognizing
 that for all odd primes (which they all are other than two), and that all
 word sizes are of an even number of bits, there will be a repeating
 pattern of composite number culls that repeats every word size number of
 bits.  Thus for a word size of one eight-bit byte, we can unroll to eight
 composite culls in the body of one loop, with loops cases for the primes
 modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7)
 meaning there are four times eight is 32 loop cases.  When there are no
 longer a full eight culls left, the culling reverts to conventional
 single-cull-per-loop as per the test program of ticket #12798.

 To do this using GHC we need pointer arithmetic, and the only way to
 implement pointer arithmetic in GHC is to use the Addr# primitive.
 GHC/Haskell has one other slight overhead over Cee languages in that we
 need to move the culling array to a pinned array to avoid having it moved
 while the culling is going on and then move it back when finished but this
 takes a negligible amount of time (one percent or so) as compared to the
 culling.  As usual for test programs, the culling operations are repeated
 in a loop for a number of times to give more accurate timing not
 influenced by execution not related to the culling.  All of this is
 included in the following code (truncated as to loop coses for inclusion
 here):
 {{{#!hs
 -- EfficiencyBug.hs
 -- showing that there is a register loop invariant bug in generation of
 assembler code...

 -- LLVM shows the bug clearer since the code is generally a little
 faster...
 {-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-}
 {-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm

 import Data.Word
 import Data.Bits
 import Data.Array.ST (runSTUArray)
 import Data.Array.Base
 import GHC.ST ( ST(..) )
 import GHC.Exts

 numLOOPS = 10000 :: Int

 -- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one
 L1 cache size) to prove it.
 twos :: UArray Int Word32
 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]

 soep1 :: () -> [Word32]
 soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where
  bufb = runSTUArray $ do
   let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes
   bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool)
   cullb bf
  cullb bf@(STUArray l u n marr#) = ST $ \s0# ->
   case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) ->
   let loop t mr# s0# = -- cull a number of times to test timing
         if t <= 0 then (# s0#, STUArray l u n mr# #) else
         case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) ->
         case newPinnedByteArray# n# s1#         of { (# s2#, marr'# #) ->
         case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# ->
         case unsafeFreezeByteArray# marr'# s3#  of { (# s4#, arr# #) -> --
 must do this
         case byteArrayContents# arr#            of { adr# -> -- to obtain
 the addr# here
         let cullp i@(I# i#) sp# =
               let !p@(I# p#) = i + i + 3 in
               let !s@(I# s#) = (p * p - 3) `div` 2 in
               if s >= n then
                 case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of
                   so# -> (# so#, mr# #) else
                 let !(UArray _ _ _ tarr#) = twos in
                 case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#)
 sp# of { (# sp0#, v0# #) ->
                 case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i#
 `andI#` 63#))) `eqWord#` (int2Word# 0#) of
                   0# -> cullp (i + 1) sp0# -- not prime
                   _ -> -- is prime
                     -- most program execution time spent in the following
 tight loops.
                     -- the following code implments extream loop
 unrolling...
                     let !pi@(I# pi#) = fromIntegral p in
                     let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) =
 s .&. 7 in
                     let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in
                     let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in
                     let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in
                     let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in
                     let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in
                     let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in
                     let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in
                     let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7
 in
                     let !lmt1# = plusAddr# adr# lmt# in
                     let !strt# = plusAddr# adr# sw# in
                     let !(I# n#) = n in
                     let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&.
 3) `shiftL` 3) + sb of {
                       0 ->
                         let cull c# sp# =
                               case c# `ltAddr#` lmt1# of
                                 0# -> (# c#, sp# #)
                                 _ ->
                                   case readWord8OffAddr# c# 0# sp# of { (#
 sp0#, v0# #) ->
                                   case writeWord8OffAddr# c# 0# (v0# `or#`
 (int2Word# 1#)) sp0# of { sp1# ->
                                   case readWord8OffAddr# c# r1# sp1# of {
 (# sp2#, v1# #) ->
                                   case writeWord8OffAddr# c# r1# (v1#
 `or#` (int2Word# 2#)) sp2# of { sp3# ->
                                   case readWord8OffAddr# c# r2# sp3# of {
 (# sp4#, v2# #) ->
                                   case writeWord8OffAddr# c# r2# (v2#
 `or#` (int2Word# 4#)) sp4# of { sp5# ->
                                   case readWord8OffAddr# c# r3# sp5# of {
 (# sp6#, v3# #) ->
                                   case writeWord8OffAddr# c# r3# (v3#
 `or#` (int2Word# 8#)) sp6# of { sp7# ->
                                   case readWord8OffAddr# c# r4# sp7# of {
 (# sp8#, v4# #) ->
                                   case writeWord8OffAddr# c# r4# (v4#
 `or#` (int2Word# 16#)) sp8# of { sp9# ->
                                   case readWord8OffAddr# c# r5# sp9# of {
 (# sp10#, v5# #) ->
                                   case writeWord8OffAddr# c# r5# (v5#
 `or#` (int2Word# 32#)) sp10# of { sp11# ->
                                   case readWord8OffAddr# c# r6# sp11# of {
 (# sp12#, v6# #) ->
                                   case writeWord8OffAddr# c# r6# (v6#
 `or#` (int2Word# 64#)) sp12# of { sp13# ->
                                   case readWord8OffAddr# c# r7# sp13# of {
 (# sp14#, v7# #) ->
                                   case writeWord8OffAddr# c# r7# (v7#
 `or#` (int2Word# 128#)) sp14# of { sp15# ->
                                   cull (plusAddr# c# pi#) sp15#
 }}}}}}}}}}}}}}}} in
                         cull strt# sp0#;
                       1 ->
                         let cull c# sp# =
                               case c# `ltAddr#` lmt1# of
                                 0# -> (# c#, sp# #)
                                 _ ->
                                   case readWord8OffAddr# c# 0# sp# of { (#
 sp0#, v0# #) ->
                                   case writeWord8OffAddr# c# 0# (v0# `or#`
 (int2Word# 2#)) sp0# of { sp1# ->
                                   case readWord8OffAddr# c# r1# sp1# of {
 (# sp2#, v1# #) ->
                                   case writeWord8OffAddr# c# r1# (v1#
 `or#` (int2Word# 4#)) sp2# of { sp3# ->
                                   case readWord8OffAddr# c# r2# sp3# of {
 (# sp4#, v2# #) ->
                                   case writeWord8OffAddr# c# r2# (v2#
 `or#` (int2Word# 8#)) sp4# of { sp5# ->
                                   case readWord8OffAddr# c# r3# sp5# of {
 (# sp6#, v3# #) ->
                                   case writeWord8OffAddr# c# r3# (v3#
 `or#` (int2Word# 16#)) sp6# of { sp7# ->
                                   case readWord8OffAddr# c# r4# sp7# of {
 (# sp8#, v4# #) ->
                                   case writeWord8OffAddr# c# r4# (v4#
 `or#` (int2Word# 32#)) sp8# of { sp9# ->
                                   case readWord8OffAddr# c# r5# sp9# of {
 (# sp10#, v5# #) ->
                                   case writeWord8OffAddr# c# r5# (v5#
 `or#` (int2Word# 64#)) sp10# of { sp11# ->
                                   case readWord8OffAddr# c# r6# sp11# of {
 (# sp12#, v6# #) ->
                                   case writeWord8OffAddr# c# r6# (v6#
 `or#` (int2Word# 128#)) sp12# of { sp13# ->
                                   case readWord8OffAddr# c# r7# sp13# of {
 (# sp14#, v7# #) ->
                                   case writeWord8OffAddr# c# r7# (v7#
 `or#` (int2Word# 1#)) sp14# of { sp15# ->
                                   cull (plusAddr# c# pi#) sp15#
 }}}}}}}}}}}}}}}} in
                         cull strt# sp0#;
                       -- and so on for 30 more cases...
                       _ -> (# strt#, sp0# #) {- normally never taken case,
 all cases covered -} } in
                     let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#`
 3#) +# sb# in
                     -- extreme loop unrolling ends here; remaining primes
 culled conventionally...
                     let cull j# sc# = -- very tight inner loop
                           case j# <# n# of
                             0# -> cullp (i + 1) sc#
                             _ -> let i# = j# `andI#` 31# in
                                  let !sh# = indexWord32Array# tarr# i# in
 -- (1 `shiftL` (j .&. 31)))
                                  let w# = j# `uncheckedIShiftRL#` 5# in
                                  case readWord32Array# marr'# w# sc# of {
 (# sc0#, ov# #) ->
                                  case writeWord32Array# marr'# w# (ov#
 `or#` sh#) sc0# of { sc1# ->
                                  cull (j# +# pi#) sc1# }} in
                     cull ns# sp0# } in
         case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}}
 in loop numLOOPS marr# s1# }

 main = print $ length $ soep1()
 }}}
 '''The problem:'''  The problem is in the innermost loop of the cases, for
 which case "0" the following assembly code (using -fllvm) is produced:
 {{{
 seGU_info$def:
 # BB#0:                                 # %cgRL
         cmpq    %r14, 70(%rbx)
         jbe     .LBB35_1
         .align  16, 0x90
 .LBB35_2:                               # %cgRJ
                                         # =>This Inner Loop Header:
 Depth=1
         movq    14(%rbx), %rcx
         movq    22(%rbx), %rdx
         movq    30(%rbx), %rsi
         movq    38(%rbx), %rdi
         movq    46(%rbx), %r8
         movq    54(%rbx), %r9
         movq    62(%rbx), %r10
         movq    6(%rbx), %rax
         addq    %r14, %rax
         orb     $1, (%r14)
         orb     $2, (%rcx,%r14)
         orb     $4, (%rdx,%r14)
         orb     $8, (%rsi,%r14)
         orb     $16, (%rdi,%r14)
         orb     $32, (%r8,%r14)
         orb     $64, (%r9,%r14)
         orb     $-128, (%r10,%r14)
         cmpq    70(%rbx), %rax
         movq    %rax, %r14
         jb      .LBB35_2
         jmp     .LBB35_3
 .LBB35_1:
         movq    %r14, %rax
 .LBB35_3:                               # %cgRK
         movq    (%rbp), %rcx
         movq    %rax, %rbx
         rex64 jmpq      *%rcx           # TAILCALL
 }}}
 One can readily see that the compiler is not lifting the Loop Invariant
 Code Flow as in initializing the registers to outside the inner loop,
 meaning there are many register loads from memory which are not necessary.

 '''Desired results:'''  The desired assembly code is something like the
 following, which is similar to what is produced by Cee (C/C++/Rust/etc.):
 {{{
 seGU_info$def:
 # BB#0:                                 # %cgRL
         movq    14(%rbx), %rcx
         movq    22(%rbx), %rdx
         movq    30(%rbx), %rsi
         movq    38(%rbx), %rdi
         movq    46(%rbx), %r8
         movq    54(%rbx), %r9
         movq    62(%rbx), %r10
         movq    6(%rbx), %rax
         movq    70(%rbx), %rbx
         cmpq    %r14, %rbx              # rbx clobbered here, but old
 value
         jbe     .LBB35_1                # never used again until replaced
 after loop
         .align  16, 0x90
 .LBB35_2:                               # %cgRJ
                                         # =>This Inner Loop Header:
 Depth=1
         orb     $1, (%r14)
         orb     $2, (%rcx,%r14)
         orb     $4, (%rdx,%r14)
         orb     $8, (%rsi,%r14)
         orb     $16, (%rdi,%r14)
         orb     $32, (%r8,%r14)
         orb     $64, (%r9,%r14)
         orb     $-128, (%r10,%r14)
         addq    %rax, %r14
         cmpq    %rbx, %r14
         jb      .LBB35_2
         jmp     .LBB35_3
 .LBB35_1:
         movq    %r14, %rax
 .LBB35_3:                               # %cgRK
         movq    (%rbp), %rcx
         movq    %rax, %rbx              # rbx clobbered here anyway
         rex64 jmpq      *%rcx           # TAILCALL
 }}}
 '''Full testing:'''  The actual unrolled loop code including all the case
 loops is too long to post here, but to verify the result is correct
 (23000) and the performance, the full actual file is attached here.  Due
 to the magic of modern CPU instruction fusion and Out Of Order (OOE)
 execution, the code is not as slow as it would indicate by the number of
 increased instructions, but while it is about twice as fast as when culled
 conventionally (Intel Skylake), it is about half again as slow as Cee. On
 an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this
 takes about one second, about two seconds culled conventionally, but only
 about 0.6 seconds for Rust/LLVM (with the assembly code output essentially
 identical to the "desired" code).

 '''Other back ends and targets:'''  Although the code generated by the
 native NCG has other problems (not moving the loop test to the end of the
 loop to avoid one jump, and not combining the read and modify and store
 instructions into the single available instruction), it exhibits the same
 problem as to not lifting the Loop Invariant Code Flow register
 initialization.

 Although this code is x86_64, the problem also applies to x86 code even
 though the x86 architecture doesn't have enough registers to do this in
 one loop and needs to be split into two loops culling only four composites
 per loop, but there still is a significant gain in speed.  Although not
 tested, it probably also applies to other targets such as ARM (which has
 many general purpose registers).

 '''Conclusion:'''  The use of Addr# primitives is probably not a frequent
 use case, but as shown here that when one needs their use, they should be
 efficient.

 I considered that GHC may intentionally limit the performance of these
 unsafe primitives to limit their use unless absolutely necessary as in
 marshalling, something as C# does for the use of unsafe pointers, but
 surely GHC would not do that as the target programmers are different.

 '''If this and ticket #12798 were fixed, for this use case the GHC code
 would be within a percent or two of the performance of Cee.'''

--

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:2
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler