
#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by GordonBGood: @@ -273,2 +273,4 @@ - cmpq %r14, %rbx - jbe .LBB35_1 + cmpq %r14, %rbx # rbx clobbered here, but old + value + jbe .LBB35_1 # never used again until replaced + after loop New description: '''Background:''' I've been intrigued investigating whether GHC can produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and have been using the very tight inner composite culling loops (purely integer operations) of a basic Sieve of Eratosthenes implementation as a test vehicle. '''Synopsis:''' This is a follow-on of the work leading to finding the efficiency problem described in ticket #12798, but involves pushing the speed even further as per the method described for "primesieve as per [http://primesieve.org/] in the "Highly optimized inner loop" section. '''Description of test code:''' Essentially, this method involves extreme loop unrolling with very imperative code although coded functionally; in the case of the following code it means that, recognizing that for all odd primes (which they all are other than two), and that all word sizes are of an even number of bits, there will be a repeating pattern of composite number culls that repeats every word size number of bits. Thus for a word size of one eight-bit byte, we can unroll to eight composite culls in the body of one loop, with loops cases for the primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7) meaning there are four times eight is 32 loop cases. When there are no longer a full eight culls left, the culling reverts to conventional single-cull-per-loop as per the test program of ticket #12798. To do this using GHC we need pointer arithmetic, and the only way to implement pointer arithmetic in GHC is to use the Addr# primitive. GHC/Haskell has one other slight overhead over Cee languages in that we need to move the culling array to a pinned array to avoid having it moved while the culling is going on and then move it back when finished but this takes a negligible amount of time (one percent or so) as compared to the culling. As usual for test programs, the culling operations are repeated in a loop for a number of times to give more accurate timing not influenced by execution not related to the culling. All of this is included in the following code (truncated as to loop coses for inclusion here): {{{#!hs -- EfficiencyBug.hs -- showing that there is a register loop invariant bug in generation of assembler code... -- LLVM shows the bug clearer since the code is generally a little faster... {-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-} {-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base import GHC.ST ( ST(..) ) import GHC.Exts numLOOPS = 10000 :: Int -- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one L1 cache size) to prove it. twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soep1 :: () -> [Word32] soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool) cullb bf cullb bf@(STUArray l u n marr#) = ST $ \s0# -> case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) -> let loop t mr# s0# = -- cull a number of times to test timing if t <= 0 then (# s0#, STUArray l u n mr# #) else case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) -> case newPinnedByteArray# n# s1# of { (# s2#, marr'# #) -> case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# -> case unsafeFreezeByteArray# marr'# s3# of { (# s4#, arr# #) -> -- must do this case byteArrayContents# arr# of { adr# -> -- to obtain the addr# here let cullp i@(I# i#) sp# = let !p@(I# p#) = i + i + 3 in let !s@(I# s#) = (p * p - 3) `div` 2 in if s >= n then case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of so# -> (# so#, mr# #) else let !(UArray _ _ _ tarr#) = twos in case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#) sp# of { (# sp0#, v0# #) -> case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i# `andI#` 63#))) `eqWord#` (int2Word# 0#) of 0# -> cullp (i + 1) sp0# -- not prime _ -> -- is prime -- most program execution time spent in the following tight loops. -- the following code implments extream loop unrolling... let !pi@(I# pi#) = fromIntegral p in let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) = s .&. 7 in let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7 in let !lmt1# = plusAddr# adr# lmt# in let !strt# = plusAddr# adr# sw# in let !(I# n#) = n in let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&. 3) `shiftL` 3) + sb of { 0 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 1#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 2#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 4#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 8#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 16#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 32#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 64#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 128#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; 1 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 2#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 4#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 8#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 16#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 32#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 64#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 128#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 1#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; -- and so on for 30 more cases... _ -> (# strt#, sp0# #) {- normally never taken case, all cases covered -} } in let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#` 3#) +# sb# in -- extreme loop unrolling ends here; remaining primes culled conventionally... let cull j# sc# = -- very tight inner loop case j# <# n# of 0# -> cullp (i + 1) sc# _ -> let i# = j# `andI#` 31# in let !sh# = indexWord32Array# tarr# i# in -- (1 `shiftL` (j .&. 31))) let w# = j# `uncheckedIShiftRL#` 5# in case readWord32Array# marr'# w# sc# of { (# sc0#, ov# #) -> case writeWord32Array# marr'# w# (ov# `or#` sh#) sc0# of { sc1# -> cull (j# +# pi#) sc1# }} in cull ns# sp0# } in case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}} in loop numLOOPS marr# s1# } main = print $ length $ soep1() }}} '''The problem:''' The problem is in the innermost loop of the cases, for which case "0" the following assembly code (using -fllvm) is produced: {{{ seGU_info$def: # BB#0: # %cgRL cmpq %r14, 70(%rbx) jbe .LBB35_1 .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax addq %r14, %rax orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) cmpq 70(%rbx), %rax movq %rax, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx rex64 jmpq *%rcx # TAILCALL }}} One can readily see that the compiler is not lifting the Loop Invariant Code Flow as in initializing the registers to outside the inner loop, meaning there are many register loads from memory which are not necessary. '''Desired results:''' The desired assembly code is something like the following, which is similar to what is produced by Cee (C/C++/Rust/etc.): {{{ seGU_info$def: # BB#0: # %cgRL movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax movq 70(%rbx), %rbx cmpq %r14, %rbx # rbx clobbered here, but old value jbe .LBB35_1 # never used again until replaced after loop .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) addq %rax, %r14 cmpq %rbx, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx # rbx clobbered here anyway rex64 jmpq *%rcx # TAILCALL }}} '''Full testing:''' The actual unrolled loop code including all the case loops is too long to post here, but to verify the result is correct (23000) and the performance, the full actual file is attached here. Due to the magic of modern CPU instruction fusion and Out Of Order (OOE) execution, the code is not as slow as it would indicate by the number of increased instructions, but while it is about twice as fast as when culled conventionally (Intel Skylake), it is about half again as slow as Cee. On an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this takes about one second, about two seconds culled conventionally, but only about 0.6 seconds for Rust/LLVM (with the assembly code output essentially identical to the "desired" code). '''Other back ends and targets:''' Although the code generated by the native NCG has other problems (not moving the loop test to the end of the loop to avoid one jump, and not combining the read and modify and store instructions into the single available instruction), it exhibits the same problem as to not lifting the Loop Invariant Code Flow register initialization. Although this code is x86_64, the problem also applies to x86 code even though the x86 architecture doesn't have enough registers to do this in one loop and needs to be split into two loops culling only four composites per loop, but there still is a significant gain in speed. Although not tested, it probably also applies to other targets such as ARM (which has many general purpose registers). '''Conclusion:''' The use of Addr# primitives is probably not a frequent use case, but as shown here that when one needs their use, they should be efficient. I considered that GHC may intentionally limit the performance of these unsafe primitives to limit their use unless absolutely necessary as in marshalling, something as C# does for the use of unsafe pointers, but surely GHC would not do that as the target programmers are different. '''If this and ticket #12798 were fixed, for this use case the GHC code would be within a percent or two of the performance of Cee.''' -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler