[GHC] #12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop...

#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime Unknown/Multiple | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- '''Background:''' I've been intrigued investigating whether GHC can produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and have been using the very tight inner composite culling loops (purely integer operations) of a basic Sieve of Eratosthenes implementation as a test vehicle. '''Synopsis:''' This is a follow-on of the work leading to finding the efficiency problem described in ticket #12798, but involves pushing the speed even further as per the method described for "primesieve as per [http://primesieve.org/] in the "Highly optimized inner loop" section. '''Description of test code:''' Essentially, this method involves extreme loop unrolling with very imperative code although coded functionally; in the case of the following code it means that, recognizing that for all odd primes (which they all are other than two), and that all word sizes are of an even number of bits, there will be a repeating pattern of composite number culls that repeats every word size number of bits. Thus for a word size of one eight-bit byte, we can unroll to eight composite culls in the body of one loop, with loops cases for the primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7) meaning there are four times eight is 32 loop cases. When there are no longer a full eight culls left, the culling reverts to conventional single-cull-per-loop as per the test program of ticket #12798. To do this using GHC we need pointer arithmetic, and the only way to implement pointer arithmetic in GHC is to use the Addr# primitive. GHC/Haskell has one other slight overhead over Cee languages in that we need to move the culling array to a pinned array to avoid having it moved while the culling is going on and then move it back when finished but this takes a negligible amount of time (one percent or so) as compared to the culling. As usual for test programs, the culling operations are repeated in a loop for a number of times to give more accurate timing not influenced by execution not related to the culling. All of this is included in the following code (truncated as to loop coses for inclusion here): {{{#!hs -- EfficiencyBug.hs -- showing that there is a register loop invariant bug in generation of assembler code... -- LLVM shows the bug clearer since the code is generally a little faster... {-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-} {-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base import GHC.ST ( ST(..) ) import GHC.Exts numLOOPS = 10000 :: Int -- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one L1 cache size) to prove it. twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soep1 :: () -> [Word32] soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool) cullb bf cullb bf@(STUArray l u n marr#) = ST $ \s0# -> case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) -> let loop t mr# s0# = -- cull a number of times to test timing if t <= 0 then (# s0#, STUArray l u n mr# #) else case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) -> case newPinnedByteArray# n# s1# of { (# s2#, marr'# #) -> case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# -> case unsafeFreezeByteArray# marr'# s3# of { (# s4#, arr# #) -> -- must do this case byteArrayContents# arr# of { adr# -> -- to obtain the addr# here let cullp i@(I# i#) sp# = let !p@(I# p#) = i + i + 3 in let !s@(I# s#) = (p * p - 3) `div` 2 in if s >= n then case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of so# -> (# so#, mr# #) else let !(UArray _ _ _ tarr#) = twos in case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#) sp# of { (# sp0#, v0# #) -> case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i# `andI#` 63#))) `eqWord#` (int2Word# 0#) of 0# -> cullp (i + 1) sp0# -- not prime _ -> -- is prime -- most program execution time spent in the following tight loops. -- the following code implments extream loop unrolling... let !pi@(I# pi#) = fromIntegral p in let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) = s .&. 7 in let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7 in let !lmt1# = plusAddr# adr# lmt# in let !strt# = plusAddr# adr# sw# in let !(I# n#) = n in let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&. 3) `shiftL` 3) + sb of { 0 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 1#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 2#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 4#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 8#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 16#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 32#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 64#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 128#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; 1 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 2#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 4#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 8#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 16#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 32#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 64#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 128#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 1#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; -- and so on for 30 more cases... _ -> (# strt#, sp0# #) {- normally never taken case, all cases covered -} } in let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#` 3#) +# sb# in -- extreme loop unrolling ends here; remaining primes culled conventionally... let cull j# sc# = -- very tight inner loop case j# <# n# of 0# -> cullp (i + 1) sc# _ -> let i# = j# `andI#` 31# in let !sh# = indexWord32Array# tarr# i# in -- (1 `shiftL` (j .&. 31))) let w# = j# `uncheckedIShiftRL#` 5# in case readWord32Array# marr'# w# sc# of { (# sc0#, ov# #) -> case writeWord32Array# marr'# w# (ov# `or#` sh#) sc0# of { sc1# -> cull (j# +# pi#) sc1# }} in cull ns# sp0# } in case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}} in loop numLOOPS marr# s1# } main = print $ length $ soep1() }}} '''The problem:''' The problem is in the innermost loop of the cases, for which case "0" the following assembly code (using -fllvm) is produced: {{{ seGU_info$def: # BB#0: # %cgRL cmpq %r14, 70(%rbx) jbe .LBB35_1 .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax addq %r14, %rax orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) cmpq 70(%rbx), %rax movq %rax, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx rex64 jmpq *%rcx # TAILCALL }}} One can readily see that the compiler is not lifting the Loop Invariant Code Flow as in initializing the registers to outside the inner loop, meaning there are many register loads from memory which are not necessary. '''Desired results:''' The desired assembly code is something like the following, which is similar to what is produced by Cee (C/C++/Rust/etc.): {{{ seGU_info$def: # BB#0: # %cgRL movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax movq 70(%rbx), %rbx cmpq %r14, %rbx jbe .LBB35_1 .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) addq %rax, %r14 cmpq %rbx, %rax jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx # rbx clobbered here anyway rex64 jmpq *%rcx # TAILCALL }}} '''Full testing:''' The actual unrolled loop code is too long to post here, but to verify the result is correct (23000) and the performance, the full actual file is attached here. Due to the magic of modern CPU instruction fusion, the code is not as slow as it would indicate by the number of increased instructions, but while it is about twice as fast as when culled conventionally (Intel Skylake), it is about half again as slow as Cee. On an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this takes about one second, about two seconds culled conventionally, but only about 0.6 seconds for Rust/LLVM (with the assembly code output essentially identical to the "desired" code). '''Other back ends and targets:''' Although the code generated by the native NCG has other problems (not moving the loop test to the end of the loop to avoid one jump, and not combining the read and modify and store instructions into the single available instruction), it exhibits the same problem as to not lifting the Loop Invariant Code Flow register initialization. Although this code is x86_64, the problem also applies to x86 code even though the x86 architecture doesn't have enough registers to do this in one loop and needs to be split into two loops culling only four composites per loop, but there still is a significant gain in speed. Although not tested, it probably also applies to other targets such as ARM (which has many general purpose registers). '''Conclusion:''' The use of Addr# primitives is probably not a frequent use case, but as shown here that when one needs their use, they should be efficient. I considered that GHC may intentionally limit the performance of these unsafe primitives to limit their use unless absolutely necessary as in marshalling, something as C## does for the use of unsafe pointers, but surely GHC would not do that as the target programmers are different. '''If this and ticket #12798 were fixed, for this use case the GHC code would be within a percent or two of the performance of Cee.''' -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by GordonBGood): * Attachment "GHC_EfficiencyBug.hs" added. Test program -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by GordonBGood: @@ -288,1 +288,1 @@ - cmpq %rbx, %rax + cmpq %rbx, %r14 @@ -298,10 +298,11 @@ - '''Full testing:''' The actual unrolled loop code is too long to post - here, but to verify the result is correct (23000) and the performance, the - full actual file is attached here. Due to the magic of modern CPU - instruction fusion, the code is not as slow as it would indicate by the - number of increased instructions, but while it is about twice as fast as - when culled conventionally (Intel Skylake), it is about half again as slow - as Cee. On an Intel Sky Lake i5-6500 (running at 3.5 GHz for single - threading), this takes about one second, about two seconds culled - conventionally, but only about 0.6 seconds for Rust/LLVM (with the - assembly code output essentially identical to the "desired" code). + '''Full testing:''' The actual unrolled loop code including all the case + loops is too long to post here, but to verify the result is correct + (23000) and the performance, the full actual file is attached here. Due + to the magic of modern CPU instruction fusion and Out Of Order (OOE) + execution, the code is not as slow as it would indicate by the number of + increased instructions, but while it is about twice as fast as when culled + conventionally (Intel Skylake), it is about half again as slow as Cee. On + an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this + takes about one second, about two seconds culled conventionally, but only + about 0.6 seconds for Rust/LLVM (with the assembly code output essentially + identical to the "desired" code). @@ -329,1 +330,1 @@ - marshalling, something as C## does for the use of unsafe pointers, but + marshalling, something as C# does for the use of unsafe pointers, but New description: '''Background:''' I've been intrigued investigating whether GHC can produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and have been using the very tight inner composite culling loops (purely integer operations) of a basic Sieve of Eratosthenes implementation as a test vehicle. '''Synopsis:''' This is a follow-on of the work leading to finding the efficiency problem described in ticket #12798, but involves pushing the speed even further as per the method described for "primesieve as per [http://primesieve.org/] in the "Highly optimized inner loop" section. '''Description of test code:''' Essentially, this method involves extreme loop unrolling with very imperative code although coded functionally; in the case of the following code it means that, recognizing that for all odd primes (which they all are other than two), and that all word sizes are of an even number of bits, there will be a repeating pattern of composite number culls that repeats every word size number of bits. Thus for a word size of one eight-bit byte, we can unroll to eight composite culls in the body of one loop, with loops cases for the primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7) meaning there are four times eight is 32 loop cases. When there are no longer a full eight culls left, the culling reverts to conventional single-cull-per-loop as per the test program of ticket #12798. To do this using GHC we need pointer arithmetic, and the only way to implement pointer arithmetic in GHC is to use the Addr# primitive. GHC/Haskell has one other slight overhead over Cee languages in that we need to move the culling array to a pinned array to avoid having it moved while the culling is going on and then move it back when finished but this takes a negligible amount of time (one percent or so) as compared to the culling. As usual for test programs, the culling operations are repeated in a loop for a number of times to give more accurate timing not influenced by execution not related to the culling. All of this is included in the following code (truncated as to loop coses for inclusion here): {{{#!hs -- EfficiencyBug.hs -- showing that there is a register loop invariant bug in generation of assembler code... -- LLVM shows the bug clearer since the code is generally a little faster... {-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-} {-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base import GHC.ST ( ST(..) ) import GHC.Exts numLOOPS = 10000 :: Int -- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one L1 cache size) to prove it. twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soep1 :: () -> [Word32] soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool) cullb bf cullb bf@(STUArray l u n marr#) = ST $ \s0# -> case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) -> let loop t mr# s0# = -- cull a number of times to test timing if t <= 0 then (# s0#, STUArray l u n mr# #) else case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) -> case newPinnedByteArray# n# s1# of { (# s2#, marr'# #) -> case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# -> case unsafeFreezeByteArray# marr'# s3# of { (# s4#, arr# #) -> -- must do this case byteArrayContents# arr# of { adr# -> -- to obtain the addr# here let cullp i@(I# i#) sp# = let !p@(I# p#) = i + i + 3 in let !s@(I# s#) = (p * p - 3) `div` 2 in if s >= n then case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of so# -> (# so#, mr# #) else let !(UArray _ _ _ tarr#) = twos in case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#) sp# of { (# sp0#, v0# #) -> case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i# `andI#` 63#))) `eqWord#` (int2Word# 0#) of 0# -> cullp (i + 1) sp0# -- not prime _ -> -- is prime -- most program execution time spent in the following tight loops. -- the following code implments extream loop unrolling... let !pi@(I# pi#) = fromIntegral p in let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) = s .&. 7 in let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7 in let !lmt1# = plusAddr# adr# lmt# in let !strt# = plusAddr# adr# sw# in let !(I# n#) = n in let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&. 3) `shiftL` 3) + sb of { 0 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 1#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 2#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 4#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 8#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 16#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 32#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 64#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 128#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; 1 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 2#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 4#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 8#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 16#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 32#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 64#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 128#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 1#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; -- and so on for 30 more cases... _ -> (# strt#, sp0# #) {- normally never taken case, all cases covered -} } in let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#` 3#) +# sb# in -- extreme loop unrolling ends here; remaining primes culled conventionally... let cull j# sc# = -- very tight inner loop case j# <# n# of 0# -> cullp (i + 1) sc# _ -> let i# = j# `andI#` 31# in let !sh# = indexWord32Array# tarr# i# in -- (1 `shiftL` (j .&. 31))) let w# = j# `uncheckedIShiftRL#` 5# in case readWord32Array# marr'# w# sc# of { (# sc0#, ov# #) -> case writeWord32Array# marr'# w# (ov# `or#` sh#) sc0# of { sc1# -> cull (j# +# pi#) sc1# }} in cull ns# sp0# } in case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}} in loop numLOOPS marr# s1# } main = print $ length $ soep1() }}} '''The problem:''' The problem is in the innermost loop of the cases, for which case "0" the following assembly code (using -fllvm) is produced: {{{ seGU_info$def: # BB#0: # %cgRL cmpq %r14, 70(%rbx) jbe .LBB35_1 .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax addq %r14, %rax orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) cmpq 70(%rbx), %rax movq %rax, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx rex64 jmpq *%rcx # TAILCALL }}} One can readily see that the compiler is not lifting the Loop Invariant Code Flow as in initializing the registers to outside the inner loop, meaning there are many register loads from memory which are not necessary. '''Desired results:''' The desired assembly code is something like the following, which is similar to what is produced by Cee (C/C++/Rust/etc.): {{{ seGU_info$def: # BB#0: # %cgRL movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax movq 70(%rbx), %rbx cmpq %r14, %rbx jbe .LBB35_1 .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) addq %rax, %r14 cmpq %rbx, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx # rbx clobbered here anyway rex64 jmpq *%rcx # TAILCALL }}} '''Full testing:''' The actual unrolled loop code including all the case loops is too long to post here, but to verify the result is correct (23000) and the performance, the full actual file is attached here. Due to the magic of modern CPU instruction fusion and Out Of Order (OOE) execution, the code is not as slow as it would indicate by the number of increased instructions, but while it is about twice as fast as when culled conventionally (Intel Skylake), it is about half again as slow as Cee. On an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this takes about one second, about two seconds culled conventionally, but only about 0.6 seconds for Rust/LLVM (with the assembly code output essentially identical to the "desired" code). '''Other back ends and targets:''' Although the code generated by the native NCG has other problems (not moving the loop test to the end of the loop to avoid one jump, and not combining the read and modify and store instructions into the single available instruction), it exhibits the same problem as to not lifting the Loop Invariant Code Flow register initialization. Although this code is x86_64, the problem also applies to x86 code even though the x86 architecture doesn't have enough registers to do this in one loop and needs to be split into two loops culling only four composites per loop, but there still is a significant gain in speed. Although not tested, it probably also applies to other targets such as ARM (which has many general purpose registers). '''Conclusion:''' The use of Addr# primitives is probably not a frequent use case, but as shown here that when one needs their use, they should be efficient. I considered that GHC may intentionally limit the performance of these unsafe primitives to limit their use unless absolutely necessary as in marshalling, something as C# does for the use of unsafe pointers, but surely GHC would not do that as the target programmers are different. '''If this and ticket #12798 were fixed, for this use case the GHC code would be within a percent or two of the performance of Cee.''' -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by GordonBGood: @@ -273,2 +273,4 @@ - cmpq %r14, %rbx - jbe .LBB35_1 + cmpq %r14, %rbx # rbx clobbered here, but old + value + jbe .LBB35_1 # never used again until replaced + after loop New description: '''Background:''' I've been intrigued investigating whether GHC can produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and have been using the very tight inner composite culling loops (purely integer operations) of a basic Sieve of Eratosthenes implementation as a test vehicle. '''Synopsis:''' This is a follow-on of the work leading to finding the efficiency problem described in ticket #12798, but involves pushing the speed even further as per the method described for "primesieve as per [http://primesieve.org/] in the "Highly optimized inner loop" section. '''Description of test code:''' Essentially, this method involves extreme loop unrolling with very imperative code although coded functionally; in the case of the following code it means that, recognizing that for all odd primes (which they all are other than two), and that all word sizes are of an even number of bits, there will be a repeating pattern of composite number culls that repeats every word size number of bits. Thus for a word size of one eight-bit byte, we can unroll to eight composite culls in the body of one loop, with loops cases for the primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7) meaning there are four times eight is 32 loop cases. When there are no longer a full eight culls left, the culling reverts to conventional single-cull-per-loop as per the test program of ticket #12798. To do this using GHC we need pointer arithmetic, and the only way to implement pointer arithmetic in GHC is to use the Addr# primitive. GHC/Haskell has one other slight overhead over Cee languages in that we need to move the culling array to a pinned array to avoid having it moved while the culling is going on and then move it back when finished but this takes a negligible amount of time (one percent or so) as compared to the culling. As usual for test programs, the culling operations are repeated in a loop for a number of times to give more accurate timing not influenced by execution not related to the culling. All of this is included in the following code (truncated as to loop coses for inclusion here): {{{#!hs -- EfficiencyBug.hs -- showing that there is a register loop invariant bug in generation of assembler code... -- LLVM shows the bug clearer since the code is generally a little faster... {-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-} {-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base import GHC.ST ( ST(..) ) import GHC.Exts numLOOPS = 10000 :: Int -- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one L1 cache size) to prove it. twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soep1 :: () -> [Word32] soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool) cullb bf cullb bf@(STUArray l u n marr#) = ST $ \s0# -> case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) -> let loop t mr# s0# = -- cull a number of times to test timing if t <= 0 then (# s0#, STUArray l u n mr# #) else case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) -> case newPinnedByteArray# n# s1# of { (# s2#, marr'# #) -> case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# -> case unsafeFreezeByteArray# marr'# s3# of { (# s4#, arr# #) -> -- must do this case byteArrayContents# arr# of { adr# -> -- to obtain the addr# here let cullp i@(I# i#) sp# = let !p@(I# p#) = i + i + 3 in let !s@(I# s#) = (p * p - 3) `div` 2 in if s >= n then case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of so# -> (# so#, mr# #) else let !(UArray _ _ _ tarr#) = twos in case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#) sp# of { (# sp0#, v0# #) -> case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i# `andI#` 63#))) `eqWord#` (int2Word# 0#) of 0# -> cullp (i + 1) sp0# -- not prime _ -> -- is prime -- most program execution time spent in the following tight loops. -- the following code implments extream loop unrolling... let !pi@(I# pi#) = fromIntegral p in let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) = s .&. 7 in let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7 in let !lmt1# = plusAddr# adr# lmt# in let !strt# = plusAddr# adr# sw# in let !(I# n#) = n in let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&. 3) `shiftL` 3) + sb of { 0 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 1#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 2#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 4#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 8#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 16#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 32#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 64#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 128#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; 1 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 2#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 4#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 8#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 16#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 32#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 64#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 128#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 1#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; -- and so on for 30 more cases... _ -> (# strt#, sp0# #) {- normally never taken case, all cases covered -} } in let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#` 3#) +# sb# in -- extreme loop unrolling ends here; remaining primes culled conventionally... let cull j# sc# = -- very tight inner loop case j# <# n# of 0# -> cullp (i + 1) sc# _ -> let i# = j# `andI#` 31# in let !sh# = indexWord32Array# tarr# i# in -- (1 `shiftL` (j .&. 31))) let w# = j# `uncheckedIShiftRL#` 5# in case readWord32Array# marr'# w# sc# of { (# sc0#, ov# #) -> case writeWord32Array# marr'# w# (ov# `or#` sh#) sc0# of { sc1# -> cull (j# +# pi#) sc1# }} in cull ns# sp0# } in case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}} in loop numLOOPS marr# s1# } main = print $ length $ soep1() }}} '''The problem:''' The problem is in the innermost loop of the cases, for which case "0" the following assembly code (using -fllvm) is produced: {{{ seGU_info$def: # BB#0: # %cgRL cmpq %r14, 70(%rbx) jbe .LBB35_1 .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax addq %r14, %rax orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) cmpq 70(%rbx), %rax movq %rax, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx rex64 jmpq *%rcx # TAILCALL }}} One can readily see that the compiler is not lifting the Loop Invariant Code Flow as in initializing the registers to outside the inner loop, meaning there are many register loads from memory which are not necessary. '''Desired results:''' The desired assembly code is something like the following, which is similar to what is produced by Cee (C/C++/Rust/etc.): {{{ seGU_info$def: # BB#0: # %cgRL movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax movq 70(%rbx), %rbx cmpq %r14, %rbx # rbx clobbered here, but old value jbe .LBB35_1 # never used again until replaced after loop .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) addq %rax, %r14 cmpq %rbx, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx # rbx clobbered here anyway rex64 jmpq *%rcx # TAILCALL }}} '''Full testing:''' The actual unrolled loop code including all the case loops is too long to post here, but to verify the result is correct (23000) and the performance, the full actual file is attached here. Due to the magic of modern CPU instruction fusion and Out Of Order (OOE) execution, the code is not as slow as it would indicate by the number of increased instructions, but while it is about twice as fast as when culled conventionally (Intel Skylake), it is about half again as slow as Cee. On an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this takes about one second, about two seconds culled conventionally, but only about 0.6 seconds for Rust/LLVM (with the assembly code output essentially identical to the "desired" code). '''Other back ends and targets:''' Although the code generated by the native NCG has other problems (not moving the loop test to the end of the loop to avoid one jump, and not combining the read and modify and store instructions into the single available instruction), it exhibits the same problem as to not lifting the Loop Invariant Code Flow register initialization. Although this code is x86_64, the problem also applies to x86 code even though the x86 architecture doesn't have enough registers to do this in one loop and needs to be split into two loops culling only four composites per loop, but there still is a significant gain in speed. Although not tested, it probably also applies to other targets such as ARM (which has many general purpose registers). '''Conclusion:''' The use of Addr# primitives is probably not a frequent use case, but as shown here that when one needs their use, they should be efficient. I considered that GHC may intentionally limit the performance of these unsafe primitives to limit their use unless absolutely necessary as in marshalling, something as C# does for the use of unsafe pointers, but surely GHC would not do that as the target programmers are different. '''If this and ticket #12798 were fixed, for this use case the GHC code would be within a percent or two of the performance of Cee.''' -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by carter): So roughly: We don't have loop invariant hoisting and we should? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by carter): Or it's not firing as much as we'd like? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): Yes, we don't have loop invariant hoisting (at least for Addr#) and we should, or it's not firing as we'd like. The c-- (cmm) code starts like this: {{{ cull_seCS_entry() // [R2, R1] { info_tbl: [(cgKv, label: cull_seCS_info rep:HeapRep 9 nonptrs { Fun {arity: 2 fun_type: ArgSpec 4} })] stack_info: arg_space: 8 updfr_space: Just 8 } {offset cgKv: _seCT::I64 = R2; _seCS::P64 = R1; goto cgKo; cgKo: if ((old + 0) - <highSp> < SpLim) goto cgKw; else goto cgKx; cgKw: R2 = _seCT::I64; R1 = _seCS::P64; call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8; cgKx: goto cgKn; cgKn: _seBQ::I64 = I64[_seCS::P64 + 6]; _seCD::I64 = I64[_seCS::P64 + 14]; _seCF::I64 = I64[_seCS::P64 + 22]; _seCH::I64 = I64[_seCS::P64 + 30]; _seCJ::I64 = I64[_seCS::P64 + 38]; _seCL::I64 = I64[_seCS::P64 + 46]; _seCN::I64 = I64[_seCS::P64 + 54]; _seCP::I64 = I64[_seCS::P64 + 62]; _seCQ::I64 = I64[_seCS::P64 + 70]; _cgKq::I64 = _seCT::I64 < _seCQ::I64; _seCV::I64 = _cgKq::I64; switch [-9223372036854775808 .. 9223372036854775807] _seCV::I64 { case 0 : goto cgKu; default: goto cgKt; } cgKu: goto cgKF; cgKF: R1 = _seCT::I64; call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8; cgKt: goto cgKA; cgKA: _seCY::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64]); _seCY::I64 = _seCY::I64; _cgKI::I64 = _seCY::I64 | 1; _seCZ::I64 = _cgKI::I64; I8[_seCT::I64] = %MO_UU_Conv_W64_W8(_seCZ::I64); _seD3::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCD::I64 << 0)]); _seD3::I64 = _seD3::I64; _cgKN::I64 = _seD3::I64 | 2; _seD4::I64 = _cgKN::I64; I8[_seCT::I64 + (_seCD::I64 << 0)] = %MO_UU_Conv_W64_W8(_seD4::I64); _seD8::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCF::I64 << 0)]); _seD8::I64 = _seD8::I64; _cgKS::I64 = _seD8::I64 | 4; _seD9::I64 = _cgKS::I64; I8[_seCT::I64 + (_seCF::I64 << 0)] = %MO_UU_Conv_W64_W8(_seD9::I64); _seDd::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCH::I64 << 0)]); _seDd::I64 = _seDd::I64; _cgKX::I64 = _seDd::I64 | 8; _seDe::I64 = _cgKX::I64; I8[_seCT::I64 + (_seCH::I64 << 0)] = %MO_UU_Conv_W64_W8(_seDe::I64); _seDi::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCJ::I64 << 0)]); _seDi::I64 = _seDi::I64; _cgL2::I64 = _seDi::I64 | 16; _seDj::I64 = _cgL2::I64; I8[_seCT::I64 + (_seCJ::I64 << 0)] = %MO_UU_Conv_W64_W8(_seDj::I64); _seDn::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCL::I64 << 0)]); _seDn::I64 = _seDn::I64; _cgL7::I64 = _seDn::I64 | 32; _seDo::I64 = _cgL7::I64; I8[_seCT::I64 + (_seCL::I64 << 0)] = %MO_UU_Conv_W64_W8(_seDo::I64); _seDs::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCN::I64 << 0)]); _seDs::I64 = _seDs::I64; _cgLc::I64 = _seDs::I64 | 64; _seDt::I64 = _cgLc::I64; I8[_seCT::I64 + (_seCN::I64 << 0)] = %MO_UU_Conv_W64_W8(_seDt::I64); _seDx::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCP::I64 << 0)]); _seDx::I64 = _seDx::I64; _cgLh::I64 = _seDx::I64 | 128; _seDy::I64 = _cgLh::I64; I8[_seCT::I64 + (_seCP::I64 << 0)] = %MO_UU_Conv_W64_W8(_seDy::I64); _cgLm::I64 = _seCT::I64 + _seBQ::I64; _seDA::I64 = _cgLm::I64; _seCT::I64 = _seDA::I64; goto cgKn; } }, }}} with the register initializations outside the loops as I originally wrote it and ends up after many steps of optimizations with the initializations inside the loops as follows: {{{ cull_seCS_entry() // [R1, R2] { [(cgKv, cull_seCS_info: const 8589934596; const 38654705664; const 9;)] } {offset cgKv: _seCT::I64 = R2; _seCS::P64 = R1; goto cgKn; cgKn: switch [-9223372036854775808 .. 9223372036854775807] (_seCT::I64 < I64[_seCS::P64 + 70]) { case 0 : goto cgKu; default: goto cgKt; } cgKu: R1 = _seCT::I64; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; cgKt: _seBQ::I64 = I64[_seCS::P64 + 6]; _seCD::I64 = I64[_seCS::P64 + 14]; _seCF::I64 = I64[_seCS::P64 + 22]; _seCH::I64 = I64[_seCS::P64 + 30]; _seCJ::I64 = I64[_seCS::P64 + 38]; _seCL::I64 = I64[_seCS::P64 + 46]; _seCN::I64 = I64[_seCS::P64 + 54]; _seCP::I64 = I64[_seCS::P64 + 62]; I8[_seCT::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64]) | 1); I8[_seCT::I64 + _seCD::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCD::I64]) | 2); I8[_seCT::I64 + _seCF::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCF::I64]) | 4); I8[_seCT::I64 + _seCH::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCH::I64]) | 8); I8[_seCT::I64 + _seCJ::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCJ::I64]) | 16); I8[_seCT::I64 + _seCL::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCL::I64]) | 32); I8[_seCT::I64 + _seCN::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCN::I64]) | 64); I8[_seCT::I64 + _seCP::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCP::I64]) | 128); _seCT::I64 = _seCT::I64 + _seBQ::I64; goto cgKn; } } }}} The movement of the register initialization to inside the loops seems to happen at a very early stage (as when it is recognized that these are pointer/addr# operations) and never gets fixed... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): As I understand it, these movements could not sensibly be hoisted at the Core level, or could they? I'm failing to see how the code at the top lines up with the Cmm you are showing. Maybe show STG code too, and say how they match up? If we can do the floating in Core, that would be better! If the opportunity only gets exposed when we are in Cmm, I wonder if it's worth our doing this in Cmm, or whether it's best left to LLVM? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): Replying to [comment:6 simonpj]:
I'm failing to see how the code at the top lines up with the Cmm you are showing.
@simonpj, the cmm code shown is the first of the "cull" case loops from the Haskell GHC code. The bottom "optimized" version has had the register initialization dropped down into inside the loop.
Maybe show STG code too, and say how they match up? If we can do the floating in Core, that would be better!
Here is the STG code for the same cull loop/recursive function, massively
back tabbed for display purposes here (output of -ddump-stg from GHC
version 8.0.1 on 64-bit Windows, lines 898 through 1132):
{{{
let {
cull_seCS [Occ=LoopBreaker]
:: GHC.Prim.Addr#
-> GHC.Prim.State#
GHC.Prim.RealWorld
-> (# GHC.Prim.Addr#,
GHC.Prim.State#
GHC.Prim.RealWorld #)
[LclId,
Arity=2,
Str=DmdType ,
Unf=OtherCon []] =
sat-only \r srt:SRT:[] [c#_seCT
sp#_seCU]
case
ltAddr# [c#_seCT
lmt1#_seCQ]
of
_ [Occ=Dead]
{ __DEFAULT ->
case
readWord8OffAddr# [c#_seCT
0#
sp#_seCU]
of
_ [Occ=Dead]
{ (#,#) ipv8_seCX [Occ=Once]
ipv9_seCY [Occ=Once] ->
case
or# [ipv9_seCY
1##]
of
sat_seCZ
{ __DEFAULT ->
case
writeWord8OffAddr# [c#_seCT
0#
sat_seCZ
ipv8_seCX]
of
sp1#_seD0 [OS=OneShot]
{ __DEFAULT ->
case
readWord8OffAddr# [c#_seCT
r1#_seCD
sp1#_seD0]
of
_ [Occ=Dead]
{ (#,#) ipv10_seD2 [Occ=Once]
ipv11_seD3 [Occ=Once] ->
case
or# [ipv11_seD3
2##]
of
sat_seD4
{ __DEFAULT ->
case
writeWord8OffAddr#
[c#_seCT
r1#_seCD
sat_seD4
ipv10_seD2]
of
sp3#_seD5 [OS=OneShot]
{ __DEFAULT ->
case
readWord8OffAddr#
[c#_seCT
r2#_seCF
sp3#_seD5]
of
_ [Occ=Dead]
{ (#,#) ipv12_seD7
[Occ=Once]
ipv13_seD8
[Occ=Once] ->
case
or#
[ipv13_seD8
4##]
of
sat_seD9
{ __DEFAULT ->
case
writeWord8OffAddr# [c#_seCT
r2#_seCF
sat_seD9
ipv12_seD7]
of
sp5#_seDa
[OS=OneShot]
{
__DEFAULT ->
case
readWord8OffAddr# [c#_seCT
r3#_seCH
sp5#_seDa]
of
_
[Occ=Dead]
{
(#,#) ipv14_seDc [Occ=Once]
ipv15_seDd [Occ=Once] ->
case
or# [ipv15_seDd
8##]
of
sat_seDe
{ __DEFAULT ->
case
writeWord8OffAddr# [c#_seCT
r3#_seCH
sat_seDe
ipv14_seDc]
of
sp7#_seDf [OS=OneShot]
{ __DEFAULT ->
case
readWord8OffAddr# [c#_seCT
r4#_seCJ
sp7#_seDf]
of
_ [Occ=Dead]
{ (#,#) ipv16_seDh [Occ=Once]
ipv17_seDi [Occ=Once] ->
case
or# [ipv17_seDi
16##]
of
sat_seDj
{ __DEFAULT ->
case
writeWord8OffAddr# [c#_seCT
r4#_seCJ
sat_seDj
ipv16_seDh]
of
sp9#_seDk [OS=OneShot]
{ __DEFAULT ->
case
readWord8OffAddr# [c#_seCT
r5#_seCL
sp9#_seDk]
of
_ [Occ=Dead]
{ (#,#) ipv18_seDm [Occ=Once]
ipv19_seDn [Occ=Once] ->
case
or# [ipv19_seDn
32##]
of
sat_seDo
{ __DEFAULT ->
case
writeWord8OffAddr# [c#_seCT
r5#_seCL
sat_seDo
ipv18_seDm]
of
sp11#_seDp [OS=OneShot]
{ __DEFAULT ->
case
readWord8OffAddr# [c#_seCT
r6#_seCN
sp11#_seDp]
of
_ [Occ=Dead]
{ (#,#) ipv20_seDr [Occ=Once]
ipv21_seDs [Occ=Once] ->
case
or# [ipv21_seDs
64##]
of
sat_seDt
{ __DEFAULT ->
case
writeWord8OffAddr# [c#_seCT
r6#_seCN
sat_seDt
ipv20_seDr]
of
sp13#_seDu [OS=OneShot]
{ __DEFAULT ->
case
readWord8OffAddr# [c#_seCT
r7#_seCP
sp13#_seDu]
of
_ [Occ=Dead]
{ (#,#) ipv22_seDw [Occ=Once]
ipv23_seDx [Occ=Once] ->
case
or# [ipv23_seDx
128##]
of
sat_seDy
{ __DEFAULT ->
case
writeWord8OffAddr# [c#_seCT
r7#_seCP
sat_seDy
ipv22_seDw]
of
sp15#_seDz [OS=OneShot]
{ __DEFAULT ->
case
plusAddr# [c#_seCT
p#_seBQ]
of
sat_seDA
{ __DEFAULT ->
cull_seCS
sat_seDA
sp15#_seDz;
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
};
0# ->
(#,#) [c#_seCT
sp#_seCU];
};
} in
}}}
You can see that the STG code just reflects the original Haskell source
code and that the faulty register initialization has not yet been dropped
down to within the loop(s), so the problem is not here. The problem is in
the CMM optimization pass, thus it also applies to NCG (although of course
NCG also has other problems).
The easiest way to fix this might be to turn on the appropriate LLVM loop
invariant code flow optimizations (if they would work) and have it only
apply to LLVM.
--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:7
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by GordonBGood: @@ -9,1 +9,1 @@ - speed even further as per the method described for "primesieve as per + speed even further as per the method described for "primesieve" as per New description: '''Background:''' I've been intrigued investigating whether GHC can produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and have been using the very tight inner composite culling loops (purely integer operations) of a basic Sieve of Eratosthenes implementation as a test vehicle. '''Synopsis:''' This is a follow-on of the work leading to finding the efficiency problem described in ticket #12798, but involves pushing the speed even further as per the method described for "primesieve" as per [http://primesieve.org/] in the "Highly optimized inner loop" section. '''Description of test code:''' Essentially, this method involves extreme loop unrolling with very imperative code although coded functionally; in the case of the following code it means that, recognizing that for all odd primes (which they all are other than two), and that all word sizes are of an even number of bits, there will be a repeating pattern of composite number culls that repeats every word size number of bits. Thus for a word size of one eight-bit byte, we can unroll to eight composite culls in the body of one loop, with loops cases for the primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7) meaning there are four times eight is 32 loop cases. When there are no longer a full eight culls left, the culling reverts to conventional single-cull-per-loop as per the test program of ticket #12798. To do this using GHC we need pointer arithmetic, and the only way to implement pointer arithmetic in GHC is to use the Addr# primitive. GHC/Haskell has one other slight overhead over Cee languages in that we need to move the culling array to a pinned array to avoid having it moved while the culling is going on and then move it back when finished but this takes a negligible amount of time (one percent or so) as compared to the culling. As usual for test programs, the culling operations are repeated in a loop for a number of times to give more accurate timing not influenced by execution not related to the culling. All of this is included in the following code (truncated as to loop coses for inclusion here): {{{#!hs -- EfficiencyBug.hs -- showing that there is a register loop invariant bug in generation of assembler code... -- LLVM shows the bug clearer since the code is generally a little faster... {-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-} {-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base import GHC.ST ( ST(..) ) import GHC.Exts numLOOPS = 10000 :: Int -- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one L1 cache size) to prove it. twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soep1 :: () -> [Word32] soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool) cullb bf cullb bf@(STUArray l u n marr#) = ST $ \s0# -> case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) -> let loop t mr# s0# = -- cull a number of times to test timing if t <= 0 then (# s0#, STUArray l u n mr# #) else case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) -> case newPinnedByteArray# n# s1# of { (# s2#, marr'# #) -> case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# -> case unsafeFreezeByteArray# marr'# s3# of { (# s4#, arr# #) -> -- must do this case byteArrayContents# arr# of { adr# -> -- to obtain the addr# here let cullp i@(I# i#) sp# = let !p@(I# p#) = i + i + 3 in let !s@(I# s#) = (p * p - 3) `div` 2 in if s >= n then case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of so# -> (# so#, mr# #) else let !(UArray _ _ _ tarr#) = twos in case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#) sp# of { (# sp0#, v0# #) -> case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i# `andI#` 63#))) `eqWord#` (int2Word# 0#) of 0# -> cullp (i + 1) sp0# -- not prime _ -> -- is prime -- most program execution time spent in the following tight loops. -- the following code implments extream loop unrolling... let !pi@(I# pi#) = fromIntegral p in let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) = s .&. 7 in let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7 in let !lmt1# = plusAddr# adr# lmt# in let !strt# = plusAddr# adr# sw# in let !(I# n#) = n in let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&. 3) `shiftL` 3) + sb of { 0 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 1#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 2#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 4#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 8#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 16#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 32#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 64#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 128#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; 1 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 2#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 4#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 8#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 16#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 32#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 64#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 128#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 1#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; -- and so on for 30 more cases... _ -> (# strt#, sp0# #) {- normally never taken case, all cases covered -} } in let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#` 3#) +# sb# in -- extreme loop unrolling ends here; remaining primes culled conventionally... let cull j# sc# = -- very tight inner loop case j# <# n# of 0# -> cullp (i + 1) sc# _ -> let i# = j# `andI#` 31# in let !sh# = indexWord32Array# tarr# i# in -- (1 `shiftL` (j .&. 31))) let w# = j# `uncheckedIShiftRL#` 5# in case readWord32Array# marr'# w# sc# of { (# sc0#, ov# #) -> case writeWord32Array# marr'# w# (ov# `or#` sh#) sc0# of { sc1# -> cull (j# +# pi#) sc1# }} in cull ns# sp0# } in case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}} in loop numLOOPS marr# s1# } main = print $ length $ soep1() }}} '''The problem:''' The problem is in the innermost loop of the cases, for which case "0" the following assembly code (using -fllvm) is produced: {{{ seGU_info$def: # BB#0: # %cgRL cmpq %r14, 70(%rbx) jbe .LBB35_1 .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax addq %r14, %rax orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) cmpq 70(%rbx), %rax movq %rax, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx rex64 jmpq *%rcx # TAILCALL }}} One can readily see that the compiler is not lifting the Loop Invariant Code Flow as in initializing the registers to outside the inner loop, meaning there are many register loads from memory which are not necessary. '''Desired results:''' The desired assembly code is something like the following, which is similar to what is produced by Cee (C/C++/Rust/etc.): {{{ seGU_info$def: # BB#0: # %cgRL movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax movq 70(%rbx), %rbx cmpq %r14, %rbx # rbx clobbered here, but old value jbe .LBB35_1 # never used again until replaced after loop .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) addq %rax, %r14 cmpq %rbx, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx # rbx clobbered here anyway rex64 jmpq *%rcx # TAILCALL }}} '''Full testing:''' The actual unrolled loop code including all the case loops is too long to post here, but to verify the result is correct (23000) and the performance, the full actual file is attached here. Due to the magic of modern CPU instruction fusion and Out Of Order (OOE) execution, the code is not as slow as it would indicate by the number of increased instructions, but while it is about twice as fast as when culled conventionally (Intel Skylake), it is about half again as slow as Cee. On an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this takes about one second, about two seconds culled conventionally, but only about 0.6 seconds for Rust/LLVM (with the assembly code output essentially identical to the "desired" code). '''Other back ends and targets:''' Although the code generated by the native NCG has other problems (not moving the loop test to the end of the loop to avoid one jump, and not combining the read and modify and store instructions into the single available instruction), it exhibits the same problem as to not lifting the Loop Invariant Code Flow register initialization. Although this code is x86_64, the problem also applies to x86 code even though the x86 architecture doesn't have enough registers to do this in one loop and needs to be split into two loops culling only four composites per loop, but there still is a significant gain in speed. Although not tested, it probably also applies to other targets such as ARM (which has many general purpose registers). '''Conclusion:''' The use of Addr# primitives is probably not a frequent use case, but as shown here that when one needs their use, they should be efficient. I considered that GHC may intentionally limit the performance of these unsafe primitives to limit their use unless absolutely necessary as in marshalling, something as C# does for the use of unsafe pointers, but surely GHC would not do that as the target programmers are different. '''If this and ticket #12798 were fixed, for this use case the GHC code would be within a percent or two of the performance of Cee.''' -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

You can see that the STG code just reflects the original Haskell source code and that the faulty register initialization has not yet been dropped down to within the loop(s), so the problem is not here. The problem is in
#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): the generation of the first CMM Aha! Could you possibly make the tiniest possible example that illustrates precisely this point. You can motivate its importance by this thread, but in thinking about how to fix it, it's MUCH easier to grok a small example. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

You can see that the STG code just reflects the original Haskell
#12808: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): Replying to [comment:9 simonpj]: source code and that the faulty register initialization has not yet been dropped down to within the loop(s), so the problem is not here. The problem is in the generation of the first CMM
Aha! Could you possibly make the tiniest possible example that
illustrates precisely this point. You can motivate its importance by this
thread, but in thinking about how to fix it, it's MUCH easier to grok a
small example.
I can't cut the test program down to just a few lines as I believe that
the problem is related to pointers and pointer arithmetic (the Addr#
primitive) and thus there is some setup involved in their use in a loop
that shows the problems.
However, I have boiled the test down to a very simple tail-recursive loop
with only one cull operation per loop using an Addr# and an offset that
shows the problems; this loop is inside another loop to feed a variable
prime "p" to the inner loop so it doesn't get optimized away as constants,
and this is inside the setup code to produced the pinned byte array on
which the loop works in the following code:
{{{#!hs
-- SimpleEfficiencyBug
{-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-}
{-# OPTIONS_GHC -O3 -rtsopts -keep-s-files -ddump-stg -ddump-cmm -ddump-
opt-cmm -ddump-to-file -dumpdir . #-} -- or -O2 -fllvm -v -dcore-lint
-ddump-asm
import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base
import GHC.ST ( ST(..) )
import GHC.Exts
cull :: () -> [Int]
cull() = [i | (i, True) <- assocs arr ] where
arr = runSTUArray $ do
let bfBts = 1 `shiftL` 17 -- 16 Kilobytes worth of bits
bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool)
cullb bf
cullb (STUArray l u n marr#) = ST $ \s0# -> -- following is just setup
for the loop...
case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) ->
case newPinnedByteArray# n# s1# of { (# s2#, marr'# #) ->
case copyMutableByteArray# marr# 0# marr'# 0# n# s2# of { s3# ->
case unsafeFreezeByteArray# marr'# s3# of { (# s4#, arr# #) -> --
must do this
case byteArrayContents# arr# of { adr# -> -- to obtain the
addr# of pinned marr' here
let cullp !p@(I# p#) sp# = -- for several prime values
if p > 5 then case copyMutableByteArray# marr'# 0# marr# 0# n#
sp# of
so# -> (# so#, STUArray l u n marr# #) else
let !r1@(I# r1#) = ((p .&. 7) + p) `shiftR` 3 in -- register
offset value
let !(I# szlmt#) = n `div` 8 - r1 in
let lmt# = plusAddr# adr# szlmt# in
let doit c# s# = -- all the work is done here; herein lies the
bugs...
case c# `ltAddr#` lmt# of
0# -> s#
_ ->
case readWord8OffAddr# c# r1# s# of { (# s0#, v0# #)
->
case writeWord8OffAddr# c# r1# (v0# `or#` (int2Word#
1#)) s0# of { s1# ->
doit (plusAddr# c# p#) s1# }} in
case doit adr# sp# of sd# -> cullp (p + 2) sd# in cullp 1 s4#
}}}}}
main = print $ length $ cull()
}}}
When compiled with the "-fllvm" compiler flag, the above code produces the
following STG code for the inner loop (located by searching for the first
"doit1_"):
{{{
let {
doit1_s7pL [Occ=LoopBreaker]
:: GHC.Prim.Addr#
-> GHC.Prim.State#
GHC.Prim.RealWorld
-> GHC.Prim.State#
GHC.Prim.RealWorld
[LclId,
Arity=2,
Str=DmdType ,
Unf=OtherCon []] =
sat-only \r srt:SRT:[] [c#_s7pM
s#_s7pN]
case
ltAddr# [c#_s7pM
lmt#1_s7pJ]
of
_ [Occ=Dead]
{ __DEFAULT ->
case
readWord8OffAddr# [c#_s7pM
r1#_s7pG
s#_s7pN]
of
_ [Occ=Dead]
{ (#,#) ipv6_s7pQ [Occ=Once]
ipv7_s7pR [Occ=Once] ->
case
or# [ipv7_s7pR
1##]
of
sat_s7pS
{ __DEFAULT ->
case
writeWord8OffAddr# [c#_s7pM
r1#_s7pG
sat_s7pS
ipv6_s7pQ]
of
s1#1_s7pT [OS=OneShot]
{ __DEFAULT ->
case
plusAddr# [c#_s7pM
ww_s7pC]
of
sat_s7pU
{ __DEFAULT ->
doit1_s7pL
sat_s7pU
s1#1_s7pT;
};
};
};
};
0# ->
s#_s7pN;
};
} in
}}}
Which first produces the following CMM code (found by search for
"doit1_"):
{{{
doit1_s7pL_entry() // [R2, R1]
{ info_tbl: [(c7Kw,
label: doit1_s7pL_info
rep:HeapRep 3 nonptrs { Fun {arity: 2 fun_type:
ArgSpec 4} })]
stack_info: arg_space: 8 updfr_space: Just 8
}
{offset
c7Kw:
_s7pM::I64 = R2;
_s7pL::P64 = R1;
goto c7Kp;
c7Kp:
if ((old + 0) - <highSp> < SpLim) goto c7Kx; else goto c7Ky;
c7Kx:
R2 = _s7pM::I64;
R1 = _s7pL::P64;
call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
c7Ky:
goto c7Ko;
c7Ko:
_s7pC::I64 = I64[_s7pL::P64 + 6]; // registers initialized
inside loop here
_s7pG::I64 = I64[_s7pL::P64 + 14];
_s7pJ::I64 = I64[_s7pL::P64 + 22]; // to here
_c7Kr::I64 = _s7pM::I64 < _s7pJ::I64;
_s7pO::I64 = _c7Kr::I64;
switch [-9223372036854775808 .. 9223372036854775807] _s7pO::I64
{
case 0 : goto c7Kv;
default: goto c7Ku;
}
c7Kv:
goto c7KG;
c7KG:
call (P64[(old + 8)])() args: 8, res: 0, upd: 8;
c7Ku:
goto c7KB;
c7KB:
_s7pR::I64 = %MO_UU_Conv_W8_W64(I8[_s7pM::I64 + (_s7pG::I64 <<
0)]);
_s7pR::I64 = _s7pR::I64;
_c7KJ::I64 = _s7pR::I64 | 1;
_s7pS::I64 = _c7KJ::I64;
I8[_s7pM::I64 + (_s7pG::I64 << 0)] =
%MO_UU_Conv_W64_W8(_s7pS::I64);
_c7KO::I64 = _s7pM::I64 + _s7pC::I64;
_s7pU::I64 = _c7KO::I64;
_s7pM::I64 = _s7pU::I64;
goto c7Ko;
}
},
}}}
then after many optimization passes produces the following optimized CMM
code:
{{{
==================== Optimised Cmm ====================
2016-11-11 13:14:21.2389114 UTC
doit1_s7pL_entry() // [R1, R2]
{ [(c7Kw,
doit1_s7pL_info:
const 8589934596;
const 12884901888;
const 9;)]
}
{offset
c7Kw:
_s7pM::I64 = R2;
_s7pL::P64 = R1;
goto c7Ko;
c7Ko:
switch [-9223372036854775808 .. 9223372036854775807] (_s7pM::I64
< I64[_s7pL::P64 + 22]) {
case 0 : goto c7Kv;
default: goto c7Ku;
}
c7Kv:
call (P64[Sp])() args: 8, res: 0, upd: 8;
c7Ku:
_s7pC::I64 = I64[_s7pL::P64 + 6]; // registers initialized
inside loop here
_s7pG::I64 = I64[_s7pL::P64 + 14]; // and here
I8[_s7pM::I64 + _s7pG::I64] =
%MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_s7pM::I64 + _s7pG::I64]) | 1);
_s7pM::I64 = _s7pM::I64 + _s7pC::I64;
goto c7Ko;
}
}
}}}
and finally the following assembly code:
{{{
s7pL_info$def:
# BB#0: # %c7Kw
cmpq %r14, 22(%rbx)
jbe .LBB18_2
.align 16, 0x90
.LBB18_1: # %c7Ku
# =>This Inner Loop Header:
Depth=1
movq 14(%rbx), %rax # registers initialized inside loop here
movq 6(%rbx), %rcx # and here
addq %r14, %rcx
orb $1, (%rax,%r14)
cmpq 22(%rbx), %rcx # and an additional unnecessary memory
load here by LLVM?
movq %rcx, %r14 # extra unnecessary instruction if code
reformulated
jb .LBB18_1
.LBB18_2: # %c7Kv
movq (%rbp), %rax
rex64 jmpq *%rax # TAILCALL
}}}
I find no problems with the STG code, but the problems persist through all
of the other codes including the initial CMM code. I have commented on
where the problems are in the above codes. I would like to see the
optimized CMM code look like the following:
{{{
doit1_s7pL_entry() // [R1, R2]
{ [(c7Kw,
doit1_s7pL_info:
const 8589934596;
const 12884901888;
const 9;)]
}
{offset
c7Kw:
_s7pM::I64 = R2;
_s7pL::P64 = R1;
_s7pC::I64 = I64[_s7pL::P64 + 6]; // registers initialized
outside loop here
_s7pG::I64 = I64[_s7pL::P64 + 14]; // and here
goto c7Ko;
c7Ko:
switch [-9223372036854775808 .. 9223372036854775807] (_s7pM::I64
< I64[_s7pL::P64 + 22]) {
case 0 : goto c7Kv;
default: goto c7Ku;
}
c7Kv:
call (P64[Sp])() args: 8, res: 0, upd: 8;
c7Ku:
I8[_s7pM::I64 + _s7pG::I64] =
%MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_s7pM::I64 + _s7pG::I64]) | 1);
_s7pM::I64 = _s7pM::I64 + _s7pC::I64;
goto c7Ko;
}
}
}}}
which should produce the following desired assembly code:
{{{
s7pL_info$def:
# BB#0: # %c7Kw
movq 14(%rbx), %rax # registers initialized outside loop here
movq 6(%rbx), %rcx # and here
movq 22(%rbx), %rbx # and here
cmpq %r14, %rbx
jbe .LBB18_2
.align 16, 0x90
.LBB18_1: # %c7Ku
# =>This Inner Loop Header:
Depth=1
orb $1, (%rax,%r14)
addq %rcx, %r14
cmpq %rbx, %rcx # use a register for comparison
jb .LBB18_1
.LBB18_2: # %c7Kv
movq (%rbp), %rax
rex64 jmpq *%rax # TAILCALL
}}}
The above assembly code is about as good as it gets in any language, and
GHC should be able to produce this, at least with the LLVM backend.
--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:10
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

#12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code
Flow not lifted outside the loop...
-------------------------------------+-------------------------------------
Reporter: GordonBGood | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 8.0.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by GordonBGood):
It seems that the loop invariant code flow not being lifted out of the
loops in not limited to primitive operations (also including Addr#), but
is a general case for any code that is not purely strict, thus anything
involving boxed thunks does not seem to be optimized properly.
The following code of a simple naive Sieve of Eratosthenes implementation
with the composite number culling operations run a number of times in a
loop for better timing purposes demonstrates the problem:
{{{
{-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-}
{-# OPTIONS_GHC -O3 -rtsopts -keep-s-files -ddump-stg -ddump-cmm -ddump-
opt-cmm -ddump-to-file -dumpdir . #-} -- or -O2 -fllvm -v -dcore-lint
-ddump-asm
import Data.Word
import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base
import GHC.ST ( ST(..) )
twos = listArray (0, 31) [ 1 `shiftL` i | i <- [0 .. 31]] :: UArray Int
Word32
eos :: Int -> [Int]
eos top = [fromIntegral i | (i, False) <- assocs cmpsts] where
cmpsts = runSTUArray $ do
cmpstsb <- newArray (0, top) False :: ST s (STUArray s Int Bool)
cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s
Int Word32)) cmpstsb
unsafeWrite cmpstsw 0 3 -- precull 0 and 1
let loop i =
if i <= 0 then return cmpstsb else
let nxtp p =
let s = p * p in
if s > top then loop (i - 1) else do
v <- unsafeRead cmpstsw (p `shiftR` 5)
if v .&. unsafeAt twos (p .&. 31) /= 0 then nxtp (p + 1)
else
let nxtc c =
if c > top then return () else do
let w = c `shiftR` 5
v <- unsafeRead cmpstsw w
unsafeWrite cmpstsw w (v .|. unsafeAt twos (c .&.
31))
nxtc (c + p) in do { nxtc s; nxtp (p + 1) } in
twos `seq` nxtp 2
loop (10000 :: Int)
main = print $ length $ eos(131071)
}}}
When run with the -fllvm (LLVM back end) compiler flag, it produces the
following STG code for the inner loop (located by searching for "nxtc",
massively indented for display):
{{{
let {
$wnxtc_s7Ru [InlPrag=[0],
Occ=LoopBreaker]
:: GHC.Prim.Int#
-> GHC.Prim.State#
GHC.Prim.RealWorld
-> (# GHC.Prim.State#
GHC.Prim.RealWorld,
() #)
[LclId,
Arity=2,
Str=DmdType ,
Unf=OtherCon []] =
sat-only \r srt:SRT:[] [ww1_s7Rv
w1_s7Rw]
case
># [ww1_s7Rv
131071#]
of
sat_s7Rx
{ __DEFAULT ->
case
tagToEnum# [sat_s7Rx]
of
_ [Occ=Dead]
{ GHC.Types.False ->
case
uncheckedIShiftRA# [ww1_s7Rv
5#]
of
i#_s7Rz [Dmd=]
{ __DEFAULT ->
case
readWord32Array# [ipv1_s7R1
i#_s7Rz
w1_s7Rw]
of
_ [Occ=Dead]
{ (#,#) ipv8_s7RB [Occ=Once]
ipv9_s7RC [Occ=Once] ->
case
andI# [ww1_s7Rv
31#]
of
sat_s7RD
{ __DEFAULT ->
case
indexWord32Array# [ipv5_s7Rf
sat_s7RD]
of
wild5_s7RE
{ __DEFAULT ->
case
or# [ipv9_s7RC
wild5_s7RE]
of
sat_s7RF
{ __DEFAULT ->
case
writeWord32Array#
[ipv1_s7R1
i#_s7Rz
sat_s7RF
ipv8_s7RB]
of
s2#1_s7RG [OS=OneShot]
{ __DEFAULT ->
case
+# [ww1_s7Rv
ww_s7Rh]
of
sat_s7RH
{ __DEFAULT ->
$wnxtc_s7Ru
sat_s7RH
s2#1_s7RG;
};
};
};
};
};
};
};
GHC.Types.True ->
(#,#) [w1_s7Rw
GHC.Tuple.()];
};
};
} in
}}}
This, in turn produces the following CMM code:
{{{
c8oB:
_s7R1::P64 = P64[_s7Ru::P64 + 6];
_s7Rf::P64 = P64[_s7Ru::P64 + 14];
_s7Rh::I64 = I64[_s7Ru::P64 + 22];
_c8oE::I64 = %MO_S_Gt_W64(_s7Rv::I64, 131071);
_s7Rx::I64 = _c8oE::I64;
switch [0 .. 1] _s7Rx::I64 {
case 0 : goto c8oM;
case 1 : goto c8oN;
}
c8oN:
R1 = GHC.Tuple.()_closure+1;
call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
c8oM:
_c8oP::I64 = %MO_S_Shr_W64(_s7Rv::I64, 5);
_s7Rz::I64 = _c8oP::I64;
_s7RC::I64 = %MO_UU_Conv_W32_W64(I32[(_s7R1::P64 + 16) +
(_s7Rz::I64 << 2)]);
_s7RC::I64 = _s7RC::I64;
_c8oS::I64 = _s7Rv::I64 & 31;
_s7RD::I64 = _c8oS::I64;
_c8oV::I64 = %MO_UU_Conv_W32_W64(I32[(_s7Rf::P64 + 16) +
(_s7RD::I64 << 2)]);
_s7RE::I64 = _c8oV::I64;
_c8oY::I64 = _s7RC::I64 | _s7RE::I64;
_s7RF::I64 = _c8oY::I64;
I32[(_s7R1::P64 + 16) + (_s7Rz::I64 << 2)] =
%MO_UU_Conv_W64_W32(_s7RF::I64);
_c8p3::I64 = _s7Rv::I64 + _s7Rh::I64;
_s7RH::I64 = _c8p3::I64;
_s7Rv::I64 = _s7RH::I64;
goto c8oB;
}}}
which is reduced to the following CMM code after many optimization passes:
{{{
c8oB:
switch [0 .. 1] (%MO_S_Gt_W64(_s7Rv::I64, 131071)) {
case 0 : goto c8oM;
case 1 : goto c8oN;
}
c8oN:
R1 = GHC.Tuple.()_closure+1;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
c8oM:
_s7R1::P64 = P64[_s7Ru::P64 + 6];
_s7Rh::I64 = I64[_s7Ru::P64 + 22];
_s7Rz::I64 = %MO_S_Shr_W64(_s7Rv::I64, 5);
I32[(_s7R1::P64 + 16) + (_s7Rz::I64 << 2)] =
%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7R1::P64 + 16) +
(_s7Rz::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[P64[_s7Ru::P64 + 14] +
((_s7Rv::I64 & 31 << 2) + 16)]));
_s7Rv::I64 = _s7Rv::I64 + _s7Rh::I64;
goto c8oB;
}}}
and finally the following assembly code:
{{{
.align 16, 0x90
.LBB29_1: # %c8oM
# =>This Inner Loop Header:
Depth=1
movq %r14, %rax
sarq $5, %rax
movq 6(%rbx), %rcx
movq 14(%rbx), %rdx
movl %r14d, %esi
andl $31, %esi
movl 16(%rdx,%rsi,4), %edx
addq 22(%rbx), %r14
orl %edx, 16(%rcx,%rax,4)
cmpq $131072, %r14 # imm = 0x20000
jl .LBB29_1
}}}
where one can clearly see the multiple register loads inside the inner
loop. This code runs at almost four CPU clock cycles per loop on Intel
Skylake.
It is easy to see that this code is partially non-strict by running the
`+RTS -s` command line option on the run to observed that heap use is much
higher than it should be, although not so high that it causes a
significant amount of GC or cost in execution time. The extra execution
time is almost entirely due to the register reloads seen above inside the
inner loop.
'''The Work Around'''
By merely changing the inner loop as follows, the non-strictness goes away
(as seen in the amount of heap used, which drops to a few 10's of
Kilobytes from 10's of Megabytes:
{{{
let nxtc c =
if c > top then nxtp (p + 1) else do
let w = c `shiftR` 5
v <- unsafeRead cmpstsw w
unsafeWrite cmpstsw w (v .|. unsafeAt twos (c .&.
31))
nxtc (c + p) in nxtc s in twos `seq` nxtp 2
}}}
With the modified code producing the following STG (massively indented for
display here):
{{{
lvl21_s7Rl [Dmd=]
{ __DEFAULT ->
let-no-escape {
$wnxtc_s7Rm [InlPrag=[0],
Occ=LoopBreaker]
:: GHC.Prim.Int#
-> GHC.Prim.State#
GHC.Prim.RealWorld
-> (# GHC.Prim.State#
GHC.Prim.RealWorld,
Data.Array.Base.STUArray
GHC.Prim.RealWorld
GHC.Types.Int
GHC.Types.Bool #)
[LclId,
Arity=2,
Str=DmdType ,
Unf=OtherCon []] =
sat-only \r srt:SRT:[] [ww3_s7Rn
w3_s7Ro]
case
># [ww3_s7Rn
131071#]
of
sat_s7Rp
{ __DEFAULT ->
case
tagToEnum# [sat_s7Rp]
of
_ [Occ=Dead]
{ GHC.Types.False ->
case
uncheckedIShiftRA# [ww3_s7Rn
5#]
of
i#_s7Rr [Dmd=]
{ __DEFAULT ->
case
readWord32Array# [ipv1_s7Qj
i#_s7Rr
w3_s7Ro]
of
_ [Occ=Dead]
{ (#,#) ipv8_s7Rt [Occ=Once]
ipv9_s7Ru [Occ=Once] ->
case
andI# [ww3_s7Rn
31#]
of
sat_s7Rv
{ __DEFAULT ->
case
indexWord32Array#
[ipv5_s7Qx
sat_s7Rv]
of
wild7_s7Rw
{ __DEFAULT ->
case
or# [ipv9_s7Ru
wild7_s7Rw]
of
sat_s7Rx
{ __DEFAULT ->
case
writeWord32Array# [ipv1_s7Qj
i#_s7Rr
sat_s7Rx
ipv8_s7Rt]
of
s2#1_s7Ry
[OS=OneShot]
{ __DEFAULT ->
case
+#
[ww3_s7Rn
ww2_s7R8]
of
sat_s7Rz
{
__DEFAULT ->
$wnxtc_s7Rm
sat_s7Rz
s2#1_s7Ry;
};
};
};
};
};
};
};
GHC.Types.True ->
$wnxtp1_s7R7
lvl21_s7Rl
w3_s7Ro;
};
};
} in
$wnxtc_s7Rm
x1_s7Ra
ipv6_s7Rf;
};
};
}}}
converted to the following initial CMM code:
{{{
c8o0:
switch [0 .. 1] (%MO_S_Gt_W64(_s7QO::I64, 131071)) {
case 0 : goto c8o8;
case 1 : goto c8o9;
}
c8o9:
_s7Qz::I64 = _s7QM::I64;
goto c8ni;
c8o8:
_s7QS::I64 = %MO_S_Shr_W64(_s7QO::I64, 5);
I32[(_s7Qj::P64 + 16) + (_s7QS::I64 << 2)] =
%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7Qj::P64 + 16) +
(_s7QS::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[(_s7Qx::P64 + 16) +
(_s7QO::I64 & 31 << 2)]));
_s7QO::I64 = _s7QO::I64 + _s7Qz::I64;
goto c8o0;
}}}
and the following optimized CMM code:
{{{
c8p9:
switch [0 .. 1] (%MO_S_Gt_W64(_s7Rn::I64, 131071)) {
case 0 : goto c8ph;
case 1 : goto c8pi;
}
c8pi:
_s7R8::I64 = _s7Rl::I64;
goto c8ou;
c8ph:
_s7Rr::I64 = %MO_S_Shr_W64(_s7Rn::I64, 5);
I32[(_s7Qj::P64 + 16) + (_s7Rr::I64 << 2)] =
%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7Qj::P64 + 16) +
(_s7Rr::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[(_s7Qx::P64 + 16) +
(_s7Rn::I64 & 31 << 2)]));
_s7Rn::I64 = _s7Rn::I64 + _s7R8::I64;
goto c8p9;
}}}
to produce the following almost ideal assembly code (this particular code
doesn't seem to manifest the symptoms of ticket #12798):
{{{
.LBB29_10: # %c8ph
# Parent Loop BB29_7 Depth=1
# Parent Loop BB29_8 Depth=2
# => This Inner Loop Header:
Depth=3
movq %rsi, %rdx
sarq $5, %rdx
movl %esi, %edi
andl $31, %edi
movl 16(%rcx,%rdi,4), %edi
orl %edi, 16(%r10,%rdx,4)
addq %rax, %rsi
cmpq $131071, %rsi # imm = 0x1FFFF
jle .LBB29_10
}}}
which one can see has no register loads and is almost ideal as to speed
for the purpose - it runs at about 3.09 CPU clock cycles per loop whereas
I have seen some code slightly re-ordered as produced by Clang/Rust/LLVM
that runs at about 3.00 clock cycles.
In order to fix the previous code using primitive Addr# operations for
which the ticket was opened, one just has to convince the compiler that it
is to be evaluated strictly; although this is not so easy or one runs into
the mixed lifted and un-lifted types error message.
However, there is likely a whole wide range of programs where executing
entirely strictly is either not possible or not desired. I don't see why
non-strict boxed code (for Haskell, likely the majority of code) can not
be just as effectively optimized.
'''In conclusion:''' this is a very serious performance bug that can
cause up to about a half again cost in execution time (50% increase),
occurs in many use cases with a typical performance cost of about 30% (for
instance for highly recursive code using list basted tail calls), and I
believe has a great deal to do with the general perception that (GHC)
Haskell is very much slower than Cee languages (C/C++/Rust, etc.).
--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:12
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

#12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by GordonBGood: @@ -12,11 +12,15 @@ - '''Description of test code:''' Essentially, this method involves - extreme loop unrolling with very imperative code although coded - functionally; in the case of the following code it means that, recognizing - that for all odd primes (which they all are other than two), and that all - word sizes are of an even number of bits, there will be a repeating - pattern of composite number culls that repeats every word size number of - bits. Thus for a word size of one eight-bit byte, we can unroll to eight - composite culls in the body of one loop, with loops cases for the primes - modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7) - meaning there are four times eight is 32 loop cases. When there are no - longer a full eight culls left, the culling reverts to conventional + '''Shortest possible test code that clearly shows non-strict code not + being optimized, but optimized when made strict:''' Please refer directly + to comment 12https://ghc.haskell.org/trac/ghc/ticket/12808#comment:12, + + '''A version of test code that triggered this ticket:''' Essentially, + this method involves extreme loop unrolling with very imperative code + although coded functionally; in the case of the following code it means + that, recognizing that for all odd primes (which they all are other than + two), and that all word sizes are of an even number of bits, there will be + a repeating pattern of composite number culls that repeats every word size + number of bits. Thus for a word size of one eight-bit byte, we can unroll + to eight composite culls in the body of one loop, with loops cases for the + primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions + (b0..b7) meaning there are four times eight is 32 loop cases. When there + are no longer a full eight culls left, the culling reverts to conventional New description: '''Background:''' I've been intrigued investigating whether GHC can produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and have been using the very tight inner composite culling loops (purely integer operations) of a basic Sieve of Eratosthenes implementation as a test vehicle. '''Synopsis:''' This is a follow-on of the work leading to finding the efficiency problem described in ticket #12798, but involves pushing the speed even further as per the method described for "primesieve" as per [http://primesieve.org/] in the "Highly optimized inner loop" section. '''Shortest possible test code that clearly shows non-strict code not being optimized, but optimized when made strict:''' Please refer directly to comment 12https://ghc.haskell.org/trac/ghc/ticket/12808#comment:12, '''A version of test code that triggered this ticket:''' Essentially, this method involves extreme loop unrolling with very imperative code although coded functionally; in the case of the following code it means that, recognizing that for all odd primes (which they all are other than two), and that all word sizes are of an even number of bits, there will be a repeating pattern of composite number culls that repeats every word size number of bits. Thus for a word size of one eight-bit byte, we can unroll to eight composite culls in the body of one loop, with loops cases for the primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7) meaning there are four times eight is 32 loop cases. When there are no longer a full eight culls left, the culling reverts to conventional single-cull-per-loop as per the test program of ticket #12798. To do this using GHC we need pointer arithmetic, and the only way to implement pointer arithmetic in GHC is to use the Addr# primitive. GHC/Haskell has one other slight overhead over Cee languages in that we need to move the culling array to a pinned array to avoid having it moved while the culling is going on and then move it back when finished but this takes a negligible amount of time (one percent or so) as compared to the culling. As usual for test programs, the culling operations are repeated in a loop for a number of times to give more accurate timing not influenced by execution not related to the culling. All of this is included in the following code (truncated as to loop coses for inclusion here): {{{#!hs -- EfficiencyBug.hs -- showing that there is a register loop invariant bug in generation of assembler code... -- LLVM shows the bug clearer since the code is generally a little faster... {-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-} {-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base import GHC.ST ( ST(..) ) import GHC.Exts numLOOPS = 10000 :: Int -- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one L1 cache size) to prove it. twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soep1 :: () -> [Word32] soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool) cullb bf cullb bf@(STUArray l u n marr#) = ST $ \s0# -> case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) -> let loop t mr# s0# = -- cull a number of times to test timing if t <= 0 then (# s0#, STUArray l u n mr# #) else case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) -> case newPinnedByteArray# n# s1# of { (# s2#, marr'# #) -> case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# -> case unsafeFreezeByteArray# marr'# s3# of { (# s4#, arr# #) -> -- must do this case byteArrayContents# arr# of { adr# -> -- to obtain the addr# here let cullp i@(I# i#) sp# = let !p@(I# p#) = i + i + 3 in let !s@(I# s#) = (p * p - 3) `div` 2 in if s >= n then case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of so# -> (# so#, mr# #) else let !(UArray _ _ _ tarr#) = twos in case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#) sp# of { (# sp0#, v0# #) -> case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i# `andI#` 63#))) `eqWord#` (int2Word# 0#) of 0# -> cullp (i + 1) sp0# -- not prime _ -> -- is prime -- most program execution time spent in the following tight loops. -- the following code implments extream loop unrolling... let !pi@(I# pi#) = fromIntegral p in let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) = s .&. 7 in let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7 in let !lmt1# = plusAddr# adr# lmt# in let !strt# = plusAddr# adr# sw# in let !(I# n#) = n in let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&. 3) `shiftL` 3) + sb of { 0 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 1#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 2#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 4#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 8#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 16#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 32#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 64#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 128#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; 1 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 2#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 4#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 8#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 16#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 32#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 64#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 128#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 1#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; -- and so on for 30 more cases... _ -> (# strt#, sp0# #) {- normally never taken case, all cases covered -} } in let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#` 3#) +# sb# in -- extreme loop unrolling ends here; remaining primes culled conventionally... let cull j# sc# = -- very tight inner loop case j# <# n# of 0# -> cullp (i + 1) sc# _ -> let i# = j# `andI#` 31# in let !sh# = indexWord32Array# tarr# i# in -- (1 `shiftL` (j .&. 31))) let w# = j# `uncheckedIShiftRL#` 5# in case readWord32Array# marr'# w# sc# of { (# sc0#, ov# #) -> case writeWord32Array# marr'# w# (ov# `or#` sh#) sc0# of { sc1# -> cull (j# +# pi#) sc1# }} in cull ns# sp0# } in case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}} in loop numLOOPS marr# s1# } main = print $ length $ soep1() }}} '''The problem:''' The problem is in the innermost loop of the cases, for which case "0" the following assembly code (using -fllvm) is produced: {{{ seGU_info$def: # BB#0: # %cgRL cmpq %r14, 70(%rbx) jbe .LBB35_1 .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax addq %r14, %rax orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) cmpq 70(%rbx), %rax movq %rax, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx rex64 jmpq *%rcx # TAILCALL }}} One can readily see that the compiler is not lifting the Loop Invariant Code Flow as in initializing the registers to outside the inner loop, meaning there are many register loads from memory which are not necessary. '''Desired results:''' The desired assembly code is something like the following, which is similar to what is produced by Cee (C/C++/Rust/etc.): {{{ seGU_info$def: # BB#0: # %cgRL movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax movq 70(%rbx), %rbx cmpq %r14, %rbx # rbx clobbered here, but old value jbe .LBB35_1 # never used again until replaced after loop .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) addq %rax, %r14 cmpq %rbx, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx # rbx clobbered here anyway rex64 jmpq *%rcx # TAILCALL }}} '''Full testing:''' The actual unrolled loop code including all the case loops is too long to post here, but to verify the result is correct (23000) and the performance, the full actual file is attached here. Due to the magic of modern CPU instruction fusion and Out Of Order (OOE) execution, the code is not as slow as it would indicate by the number of increased instructions, but while it is about twice as fast as when culled conventionally (Intel Skylake), it is about half again as slow as Cee. On an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this takes about one second, about two seconds culled conventionally, but only about 0.6 seconds for Rust/LLVM (with the assembly code output essentially identical to the "desired" code). '''Other back ends and targets:''' Although the code generated by the native NCG has other problems (not moving the loop test to the end of the loop to avoid one jump, and not combining the read and modify and store instructions into the single available instruction), it exhibits the same problem as to not lifting the Loop Invariant Code Flow register initialization. Although this code is x86_64, the problem also applies to x86 code even though the x86 architecture doesn't have enough registers to do this in one loop and needs to be split into two loops culling only four composites per loop, but there still is a significant gain in speed. Although not tested, it probably also applies to other targets such as ARM (which has many general purpose registers). '''Conclusion:''' The use of Addr# primitives is probably not a frequent use case, but as shown here that when one needs their use, they should be efficient. I considered that GHC may intentionally limit the performance of these unsafe primitives to limit their use unless absolutely necessary as in marshalling, something as C# does for the use of unsafe pointers, but surely GHC would not do that as the target programmers are different. '''If this and ticket #12798 were fixed, for this use case the GHC code would be within a percent or two of the performance of Cee.''' -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: JoinPoints Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonpj): * keywords: => JoinPoints * cc: maurerl@… (added) Comment: Interesting. Looking at comment:12, note the difference between the SLOW version: {{{ let nxtc c = if c > top then return () else do { ...; nxtc (c+p) } in do { nxtc ss; nxtp (p + 1) } }}} and the FAST version {{{ let nxtc c = if c > top then nxtp (p + 1) else do else do { ...; nxtc (c+p) } in nxtc ss }}} In SLOW, `nxtc` is represented by a heap-allocated closure, whereas in FAST `nxtc` is a join point, and hence not allocated at all (you can see that from the `let-no-escape` in the STG). See our paper [https://www.microsoft.com/en-us/research/publication/compiling-without- continuations/ Compiling without continuations], and [wiki:SequentCore]. Notice that in SLOW, the call to `nxtc s` is ''followed by'' a call to `nxtp (p+1)`. But in FAST we move that call right into `nxtc` itself, in the return branch of the `if`. That's what makes `nxtc` into a join point. (None of this has anything to do with non-strictness, incidentally.) This is a rather non-trivial transformation. You clearly think it's a pretty obvious optimisation, but it doesn't look obvious to me. Happily, though, our new Core-with-join-point (described in the paper) should catch this nicely. If we start with SLOW, after a bit we'll get this {{{ let nxtc c s = if c > top then (# s,c #) else case ... of (# s',p #) -> nxtc (c+p) s' } in case nxtc ss s of (# s', r #) -> nxtp (p + 1) s' } }}} Now if we do float-in, to move the `let nxtc` in to the scruintee of the case, it becomes a join point, and join-point analysis should find it. After that, the transformations in the paper will turn it into FAST. The example in comment:10 looks similar: {{{ case doit adr# sp# of sd# -> cullp (p + 2) sd# in cullp 1 s4# }}}}} }}} Here again, `doit` will become a join point if we float it in. The example in the Descrption is too big for me to analyse. I've cc'd Luke Maurer who is implementing Core with join points; this looks like another good example. (cf #12781) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: JoinPoints Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): Replying to [comment:14 simonpj]:
...
In SLOW, `nxtc` is represented by a heap-allocated closure, whereas in FAST `nxtc` is a join point, and hence not allocated at all (you can see that from the `let-no-escape` in the STG). See our paper [https://www.microsoft.com/en-us/research/publication/compiling-without- continuations/ Compiling without continuations], and [wiki:SequentCore].
Notice that in SLOW, the call to `nxtc s` is ''followed by'' a call to `nxtp (p+1)`. But in FAST we move that call right into `nxtc` itself, in
Yes, I saw that. the return branch of the `if`. That's what makes `nxtc` into a join point. I had wondered if the work on join points might help here...
(None of this has anything to do with non-strictness, incidentally.)
This is a rather non-trivial transformation. You clearly think it's a
Happily, though, our new Core-with-join-point (described in the paper) should catch this nicely. If we start with SLOW, after a bit we'll get
I just wondered if strictness might be a clue, as the FAST version consumes very low heap, whereas the SLOW version takes an amount of heap which is about the calculated amount for numLOOPS times the number of primes culled... pretty obvious optimisation, but it doesn't look obvious to me. No, I didn't think it was trivial, just that the optimisation gets triggered in the one case; I couldn't see why it couldn't be extended to the other... this
{{{ let nxtc c s = if c > top then (# s,c #) else case ... of (# s',p #) -> nxtc (c+p) s' } in case nxtc ss s of (# s', r #) -> nxtp (p + 1) s' } }}} Now if we do float-in, to move the `let nxtc` in to the scruintee of the case, it becomes a join point, and join-point analysis should find it. After that, the transformations in the paper will turn it into FAST.
The example in comment:10 looks similar: {{{ case doit adr# sp# of sd# -> cullp (p + 2) sd# in cullp 1 s4# }}}}} }}} Here again, `doit` will become a join point if we float it in. The example in the Descrption is too big for me to analyse.
Don't worry about the big example in the description as if the small example in comment 10 and this latest example in comment 12 get fixed, I'm pretty sure it will take core of that larger case too.
I've cc'd Luke Maurer who is implementing Core with join points; this looks like another good example. (cf #12781)
That is encouraging news the the Join Point Analysis should catch this. Thanks for looking at this, as I think it important in many cases beyond this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For closures, Loop Invariant Code Flow not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: JoinPoints Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by GordonBGood: @@ -12,3 +12,5 @@ - '''Shortest possible test code that clearly shows non-strict code not - being optimized, but optimized when made strict:''' Please refer directly - to comment 12https://ghc.haskell.org/trac/ghc/ticket/12808#comment:12, + '''Shortest possible test code that clearly shows closures not being + optimized, but optimized when unified by a "join point":''' Please refer + directly to comment + 12https://ghc.haskell.org/trac/ghc/ticket/12808#comment:12 and follow-on + comments. New description: '''Background:''' I've been intrigued investigating whether GHC can produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and have been using the very tight inner composite culling loops (purely integer operations) of a basic Sieve of Eratosthenes implementation as a test vehicle. '''Synopsis:''' This is a follow-on of the work leading to finding the efficiency problem described in ticket #12798, but involves pushing the speed even further as per the method described for "primesieve" as per [http://primesieve.org/] in the "Highly optimized inner loop" section. '''Shortest possible test code that clearly shows closures not being optimized, but optimized when unified by a "join point":''' Please refer directly to comment 12https://ghc.haskell.org/trac/ghc/ticket/12808#comment:12 and follow-on comments. '''A version of test code that triggered this ticket:''' Essentially, this method involves extreme loop unrolling with very imperative code although coded functionally; in the case of the following code it means that, recognizing that for all odd primes (which they all are other than two), and that all word sizes are of an even number of bits, there will be a repeating pattern of composite number culls that repeats every word size number of bits. Thus for a word size of one eight-bit byte, we can unroll to eight composite culls in the body of one loop, with loops cases for the primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7) meaning there are four times eight is 32 loop cases. When there are no longer a full eight culls left, the culling reverts to conventional single-cull-per-loop as per the test program of ticket #12798. To do this using GHC we need pointer arithmetic, and the only way to implement pointer arithmetic in GHC is to use the Addr# primitive. GHC/Haskell has one other slight overhead over Cee languages in that we need to move the culling array to a pinned array to avoid having it moved while the culling is going on and then move it back when finished but this takes a negligible amount of time (one percent or so) as compared to the culling. As usual for test programs, the culling operations are repeated in a loop for a number of times to give more accurate timing not influenced by execution not related to the culling. All of this is included in the following code (truncated as to loop coses for inclusion here): {{{#!hs -- EfficiencyBug.hs -- showing that there is a register loop invariant bug in generation of assembler code... -- LLVM shows the bug clearer since the code is generally a little faster... {-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-} {-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base import GHC.ST ( ST(..) ) import GHC.Exts numLOOPS = 10000 :: Int -- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one L1 cache size) to prove it. twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soep1 :: () -> [Word32] soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool) cullb bf cullb bf@(STUArray l u n marr#) = ST $ \s0# -> case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) -> let loop t mr# s0# = -- cull a number of times to test timing if t <= 0 then (# s0#, STUArray l u n mr# #) else case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) -> case newPinnedByteArray# n# s1# of { (# s2#, marr'# #) -> case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# -> case unsafeFreezeByteArray# marr'# s3# of { (# s4#, arr# #) -> -- must do this case byteArrayContents# arr# of { adr# -> -- to obtain the addr# here let cullp i@(I# i#) sp# = let !p@(I# p#) = i + i + 3 in let !s@(I# s#) = (p * p - 3) `div` 2 in if s >= n then case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of so# -> (# so#, mr# #) else let !(UArray _ _ _ tarr#) = twos in case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#) sp# of { (# sp0#, v0# #) -> case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i# `andI#` 63#))) `eqWord#` (int2Word# 0#) of 0# -> cullp (i + 1) sp0# -- not prime _ -> -- is prime -- most program execution time spent in the following tight loops. -- the following code implments extream loop unrolling... let !pi@(I# pi#) = fromIntegral p in let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) = s .&. 7 in let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7 in let !lmt1# = plusAddr# adr# lmt# in let !strt# = plusAddr# adr# sw# in let !(I# n#) = n in let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&. 3) `shiftL` 3) + sb of { 0 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 1#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 2#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 4#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 8#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 16#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 32#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 64#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 128#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; 1 -> let cull c# sp# = case c# `ltAddr#` lmt1# of 0# -> (# c#, sp# #) _ -> case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) -> case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 2#)) sp0# of { sp1# -> case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) -> case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 4#)) sp2# of { sp3# -> case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) -> case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 8#)) sp4# of { sp5# -> case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) -> case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 16#)) sp6# of { sp7# -> case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) -> case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 32#)) sp8# of { sp9# -> case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) -> case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 64#)) sp10# of { sp11# -> case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) -> case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 128#)) sp12# of { sp13# -> case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) -> case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 1#)) sp14# of { sp15# -> cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in cull strt# sp0#; -- and so on for 30 more cases... _ -> (# strt#, sp0# #) {- normally never taken case, all cases covered -} } in let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#` 3#) +# sb# in -- extreme loop unrolling ends here; remaining primes culled conventionally... let cull j# sc# = -- very tight inner loop case j# <# n# of 0# -> cullp (i + 1) sc# _ -> let i# = j# `andI#` 31# in let !sh# = indexWord32Array# tarr# i# in -- (1 `shiftL` (j .&. 31))) let w# = j# `uncheckedIShiftRL#` 5# in case readWord32Array# marr'# w# sc# of { (# sc0#, ov# #) -> case writeWord32Array# marr'# w# (ov# `or#` sh#) sc0# of { sc1# -> cull (j# +# pi#) sc1# }} in cull ns# sp0# } in case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}} in loop numLOOPS marr# s1# } main = print $ length $ soep1() }}} '''The problem:''' The problem is in the innermost loop of the cases, for which case "0" the following assembly code (using -fllvm) is produced: {{{ seGU_info$def: # BB#0: # %cgRL cmpq %r14, 70(%rbx) jbe .LBB35_1 .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax addq %r14, %rax orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) cmpq 70(%rbx), %rax movq %rax, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx rex64 jmpq *%rcx # TAILCALL }}} One can readily see that the compiler is not lifting the Loop Invariant Code Flow as in initializing the registers to outside the inner loop, meaning there are many register loads from memory which are not necessary. '''Desired results:''' The desired assembly code is something like the following, which is similar to what is produced by Cee (C/C++/Rust/etc.): {{{ seGU_info$def: # BB#0: # %cgRL movq 14(%rbx), %rcx movq 22(%rbx), %rdx movq 30(%rbx), %rsi movq 38(%rbx), %rdi movq 46(%rbx), %r8 movq 54(%rbx), %r9 movq 62(%rbx), %r10 movq 6(%rbx), %rax movq 70(%rbx), %rbx cmpq %r14, %rbx # rbx clobbered here, but old value jbe .LBB35_1 # never used again until replaced after loop .align 16, 0x90 .LBB35_2: # %cgRJ # =>This Inner Loop Header: Depth=1 orb $1, (%r14) orb $2, (%rcx,%r14) orb $4, (%rdx,%r14) orb $8, (%rsi,%r14) orb $16, (%rdi,%r14) orb $32, (%r8,%r14) orb $64, (%r9,%r14) orb $-128, (%r10,%r14) addq %rax, %r14 cmpq %rbx, %r14 jb .LBB35_2 jmp .LBB35_3 .LBB35_1: movq %r14, %rax .LBB35_3: # %cgRK movq (%rbp), %rcx movq %rax, %rbx # rbx clobbered here anyway rex64 jmpq *%rcx # TAILCALL }}} '''Full testing:''' The actual unrolled loop code including all the case loops is too long to post here, but to verify the result is correct (23000) and the performance, the full actual file is attached here. Due to the magic of modern CPU instruction fusion and Out Of Order (OOE) execution, the code is not as slow as it would indicate by the number of increased instructions, but while it is about twice as fast as when culled conventionally (Intel Skylake), it is about half again as slow as Cee. On an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this takes about one second, about two seconds culled conventionally, but only about 0.6 seconds for Rust/LLVM (with the assembly code output essentially identical to the "desired" code). '''Other back ends and targets:''' Although the code generated by the native NCG has other problems (not moving the loop test to the end of the loop to avoid one jump, and not combining the read and modify and store instructions into the single available instruction), it exhibits the same problem as to not lifting the Loop Invariant Code Flow register initialization. Although this code is x86_64, the problem also applies to x86 code even though the x86 architecture doesn't have enough registers to do this in one loop and needs to be split into two loops culling only four composites per loop, but there still is a significant gain in speed. Although not tested, it probably also applies to other targets such as ARM (which has many general purpose registers). '''Conclusion:''' The use of Addr# primitives is probably not a frequent use case, but as shown here that when one needs their use, they should be efficient. I considered that GHC may intentionally limit the performance of these unsafe primitives to limit their use unless absolutely necessary as in marshalling, something as C# does for the use of unsafe pointers, but surely GHC would not do that as the target programmers are different. '''If this and ticket #12798 were fixed, for this use case the GHC code would be within a percent or two of the performance of Cee.''' -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For closures, Loop Invariant Code Flow related to captured free values not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: JoinPoints Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12808: For closures, Loop Invariant Code Flow related to captured free values not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: JoinPoints Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by lukemaurer): Confirmed—with join points, heap allocation drops from 23M to 70K. Looks like the simplifier is doing exactly what we want, and //all// the loops wind up as join points. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

As I have stated, I believe there are many use cases, even to the common worker/wrapper pattern that seeks to reduce the amount of parameter
#12808: For closures, Loop Invariant Code Flow related to captured free values not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: JoinPoints Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): passing by enclosing a recursive worker closure inside a wrapper; any gains from this pattern could be cancelled and more if the wrapped enclosure is less efficient. Let me say what I ''think'' you are saying. Consider {{{ f x p q = let g y = ...g y'...x... in (g p, g q) }}} Left as-is we allocate a closure for `g` every time we call `f`. But instead we could lambda-lift `g`: {{{ g2 x y = ...g2 x y'...x... f x p q = (g2 x p, g2 x q) }}} Now we don't allocate a closure for `g`. That is good. Is this what you mean? We don't want to do this in general, early in optimisation, because we get huge benefits from being able to "see" the binding site of `g`'s free variable `x`. But these benefits are over when it comes to code generation. So we have experimented with so-called "late lambda lifting" (LLF). There's a whole wiki page about it: [wiki:LateLamLift]. It can be a real win. One obstacle to LLF is, ironically, that it can destroy join points (see the wiki page). A second benefit of Luke's new join-point work is that it becomes much easier to ensure that LLF doesn't destroy join points, and thus renders it much more practical. I think Luke will turn his attention to it once join points are solidly in. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

As I have stated, I believe there are many use cases, even to the common worker/wrapper pattern that seeks to reduce the amount of parameter
#12808: For closures, Loop Invariant Code Flow related to captured free values not lifted outside the loop... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 Resolution: | Keywords: JoinPoints Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): Replying to [comment:19 simonpj]: passing by enclosing a recursive worker closure inside a wrapper; any gains from this pattern could be cancelled and more if the wrapped enclosure is less efficient.
Let me say what I ''think'' you are saying. Consider {{{ f x p q = let g y = ...g y'...x... in (g p, g q) }}} Left as-is we allocate a closure for `g` every time we call `f`. But
{{{ g2 x y = ...g2 x y'...x...
f x p q = (g2 x p, g2 x q) }}} Now we don't allocate a closure for `g`. That is good. Is this what you mean?
We don't want to do this in general, early in optimisation, because we get huge benefits from being able to "see" the binding site of `g`'s free variable `x`. But these benefits are over when it comes to code generation.
So we have experimented with so-called "late lambda lifting" (LLF). There's a whole wiki page about it: [wiki:LateLamLift]. It can be a real win.
One obstacle to LLF is, ironically, that it can destroy join points (see
instead we could lambda-lift `g`: the wiki page). A second benefit of Luke's new join-point work is that it becomes much easier to ensure that LLF doesn't destroy join points, and thus renders it much more practical. I think Luke will turn his attention to it once join points are solidly in. Yes, Simon, I manually lifted the closure by turning all the otherwise free variables into arguments to the function; although I didn't actually move the code to the top level it could have been, as it is no longer a closure capturing free variables but rather just a free function. However, my point in doing so was not to show that even this early manual lambda lifting is effective but that the code generated inside the function became so much more efficient due to seeing the Loop Invariant Code Flow (LICF) and lifting the register loading outside of the loop; in my mind the compiler should have done this whether the code was a closure or not. As the article on lambda lifting says, this lambda lifting at an early stage can prevent some optimizations in some cases, and definitely there will be some cost in this case in run time overhead in passing all of those extra arguments to the function each and every time the function is called; however, as the recursive calls are eliminated by the tail call optimization of making a loop internal to the function, the function only gets called very few times so as to have a negligible overall impact here. There may be other negative effects of the very early lambda lifting, but again they are negligible compared to the gains made in efficiency of the internal loop due to properly using LICF lifting. So the question I pose is "Why isn't LICF lifting used for the code internal to closures when it is used for non-closures?". Your Lambda Lifting and Join Point Analysis can serve to reduce this problem by eliminating closures, but the problem is still there for cases where the closures can't be eliminated and/or shouldn't be lifted. I sometimes wonder whether this is the problem that makes LL and JPA appear to be so effective: this problem makes the code that doesn't use JPA/LL much slower than it would otherwise be so that if this problem were not there, the gains made from LL/JPA in eliminating some/many closures would not likely be so great. I brought up the wrapper/worker pattern because its whole point is to reduce the number of passed parameters that need to be tail call optimized away, but the "worker" then must needs be a closure; with the problem of this issue, in many cases the resulting wrapper/worker will be slower than if we didn't factor in the closure "worker". In the sieve code of this thread, the "nxtc" closure as originally written is essentially a "worker" to the "nxtp" wrapper function and manually lifting it as I did means it is no longer a "worker". I should have gotten slightly worse performance in doing this but instead got much better performance because the compiler now did LICF optimization on the non-closure, where it currently doesn't seem to be able to do it on a non-closure. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12808#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC