[GHC] #9246: GHC generates poor code for repeated uses of min/max

#9246: GHC generates poor code for repeated uses of min/max ------------------------------+-------------------------------------------- Reporter: arotenberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Keywords: | Operating System: Windows Architecture: x86_64 | Type of failure: Runtime performance bug (amd64) | Test Case: Difficulty: Unknown | Blocking: Blocked By: | Related Tickets: #6135 | ------------------------------+-------------------------------------------- Consider the following module, which intends to implement a [http://tavianator.com/2011/05/fast-branchless-raybounding-box- intersections/ branchless ray-AABB intersection test]: {{{ module SimpleGeom where data Vec3 = Vec3 {-# UNPACK #-} !Double {-# UNPACK #-} !Double {-# UNPACK #-} !Double data Ray = Ray !Vec3 !Vec3 !Vec3 data AABB = AABB !Vec3 !Vec3 testRayAABBIntersection :: Ray -> AABB -> Bool testRayAABBIntersection (Ray (Vec3 ox oy oz) _ (Vec3 invDx invDy invDz)) (AABB (Vec3 minX minY minZ) (Vec3 maxX maxY maxZ)) = let tx1 = (minX - ox) * invDx tx2 = (maxX - ox) * invDx ty1 = (minY - oy) * invDy ty2 = (maxY - oy) * invDy tz1 = (minZ - oz) * invDz tz2 = (maxZ - oz) * invDz tmin = min tx1 tx2 `max` min ty1 ty2 `max` min tz1 tz2 tmax = max tx1 tx2 `min` max ty1 ty2 `min` max tz1 tz2 in tmax >= max 0 tmin }}} Everything is strict primitive operations, so GHC should generate very simple, fast code, right? But upon compiling with {{{ghc -O -ddump-simpl -ddump-to-file SimpleGeom}}}, I found a mess of nested local lambdas and similar performance-killing expression forms. (See the attached output file.) There are two possible issues I can see here. 1. GHC is trying to expand out all of the branches recursively (I would presume via case-of-cases transformation), which is a bad idea in this instance compared to just performing the cases sequentially and storing their results. 1. GHC is generating branches for floating-point min/max. Instruction sets like SSE2 include non-branching floating-point min/max instructions, which is exactly what this algorithm was designed to exploit, but GHC does not generate code that could take advantage of these instructions. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Changes (by carter): * type: bug => feature request * milestone: => 7.10.1 Comment: So if you take a look at the haskell level primops that GHC provides http://www.haskell.org/ghc/docs/latest/html/libraries/ghc-prim-0.3.1.0 /GHC-Prim.html you'll notice that theres no primops for min or max on the underlying unlifted types Double# or Float# (or Int#). So if you need a branchless inner loop today, you'll either need to write a teeny bit of C you'll call out to, encode the logic as some bit fiddling on the floats, OR do something like describe the algorithm programmatically in a Haskell DSL, and at runtime use LLVM-General to generate the code you want http://hackage.haskell.org/package/llvm- general (i'm told some video codecs actually use LLVM for runtime code gen!) Adding these to ghc is a pretty reasonable feature request, though such wouldn't land till 7.10. If you're interested in doing some of the leg work , i'm happy to try to guide you through the process! (If not, thats fine too, its a really concrete task i'll give to someone who wants to do their first ghc patch, ) zooming out, 1. do you have any benchmarks (say, using criterion) for this code? 2. how does eg using -fllvm as your backend fair vs -fasm for this code in such a benchmark -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Comment (by arotenberg): I tried using criterion to profile different versions of this function, but I couldn't get the results to not be noisy and "microbenchmarky". The way I originally found this issue was by profiling the raytracer program I'm writing, which fingered this function as both the single largest time sink and the largest allocator. I tried replacing the definition with this variant: {{{ testRayAABBIntersection :: Ray -> AABB -> Bool testRayAABBIntersection (Ray (Vec3 (D# ox) (D# oy) (D# oz)) _ (Vec3 (D# invDx) (D# invDy) (D# invDz))) (AABB (Vec3 (D# minX) (D# minY) (D# minZ)) (Vec3 (D# maxX) (D# maxY) (D# maxZ))) = let tx1 = minusTimes## minX ox invDx tx2 = minusTimes## maxX ox invDx ty1 = minusTimes## minY oy invDy ty2 = minusTimes## maxY oy invDy tz1 = minusTimes## minZ oz invDz tz2 = minusTimes## maxZ oz invDz tmin = min## tx1 tx2 `max##` min## ty1 ty2 `max##` min## tz1 tz2 tmax = max## tx1 tx2 `min##` max## ty1 ty2 `min##` max## tz1 tz2 in isTrue# (tmax >=## max## 0.0## tmin) {-# NOINLINE minusTimes## #-} minusTimes## :: Double# -> Double# -> Double# -> Double# minusTimes## minA oa invDa = (minA -## oa) *## invDa {-# NOINLINE min## #-} {-# NOINLINE max## #-} min##, max## :: Double# -> Double# -> Double# max## x y = if isTrue# (x <=## y) then y else x min## x y = if isTrue# (x <=## y) then x else y }}} This gets the function's heap allocation down to the expected zero bytes, but at the cost of actually making the program slower (both with and without profiling)! I haven't tried comparing results with the LLVM backend yet. I might look into it later today. It would be nice to see non-branching min/max implemented for the primitive numeric types where available. Min/max for integer types can probably be implemented efficiently on many platforms using conditional move instructions. I'm not particularly interested in implementing it myself, but I'm happy with having it as a feature request. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Comment (by carter): I think your foo## operations can be safely marked INLINE, rather than NOINLINE. Was there a reason for the NOINLINE? the code loos pretty darn pure, so i don't see how NOINLINE would be needed, and that lack of inlining would explain why the branchless stuff was slower -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Comment (by arotenberg): On the contrary, replacing the NOINLINEs with INLINEs results in essentially the original Core, local lambdas and all. I explicitly put the NOINLINEs in because that was the only way I could find to force GHC to not expand the branches and add the unnecessary lambdas and lazy lets. Perhaps this issue should be split into two tickets - something like "Add branchless min/max primops" and "Min/max-like functions cause unnecessary closure creation"? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Changes (by carter): * type: feature request => bug * milestone: 7.10.1 => Comment: yes, lets do that. I'll switch this ticket to being the bug one and open a new one for the feature request element and we can back and forth about generating the core etc here. What were the ghc flags and optimization levels etc you were using to check the core? I'd like to make sure i can reproduce your findings before I try out my own hacking to help you out with near term work around over the next week (as my own work permits, though helping folks with perf engineering in haskell DOES overlap with my work a teeny bit) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Comment (by arotenberg): You should be able to reproduce the issue by using the versions and flags I listed in the original description with a fresh GHC 7.8.2 install. I'm in no rush to get this fixed since the program I'm working on is just a hobby project. I just ran into the issue and figured I'd report it. I mentioned "min/max-like functions" in my previous comment because it's easy to contrive other functions that cause similar issues. Try compiling this module with `ghc -O -ddump-simpl -ddump-to-file UglyBranching.hs` and look at the Core file it generates. {{{ module UglyBranching where foo :: Int -> Int -> Int -> Int -> Int foo a b c d = (((a `bar` b) `bar` (c `bar` d)) `bar` ((a `bar` c) `bar` (b `bar` d))) `bar` (((b `bar` a) `bar` (d `bar` c)) `bar` ((c `bar` a) `bar` (d `bar` b))) bar :: Int -> Int -> Int bar m n = if m + n > 5 then m else n }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Comment (by simonpj): Indeed! What Core would you ''like'' to see generates for examples like `UglyBranching`? The underlying difficulty is that the case-of-case transformation is utterly crucial for optimising Haskell programs, and it's hard to predict when it'll be unproductive, or even counter-productive. I'd welcome ideas. case-of-case is described in "A transformation based optimiser for Haskell", and (in passing) in many other papers. Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Comment (by arotenberg): Having now read "A transformation based optimiser for Haskell", I learned about the `let-no-escape` optimization, which I was not previously aware of. Looking at the output of `-ddump-stg` on UglyBranching, all of the local lambdas are actually bound by `let-no-escape`, which makes the whole question irrelevant. Well, now I feel silly. I guess if there's something to be learned here, it's that `let-no-escape` could be publicized better! Everything I've seen on the internet up until now was all "look at the Core, if you see `let`, that's a Bad Thing and you might want to Do Something About It!" (Quoth the GHC docs: "If profiling has pointed the finger at particular functions, look at their Core code. lets are bad, cases are good, ... nested lambdas are bad, ...") -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Comment (by simonpj): Yes, let-no-escape should be better documented. And I feel bad that although it's terribly important for performance, it's entirely implicit in Core, and it's not hard for a transformation to inadvertently lose the LNE property. I'd like it to be more explicit, somehow. Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Comment (by arotenberg): I don't have any great ideas on the LNE issue, other than maybe add a line in the GHC user docs about it somewhere. carter has opened #9251 for the min/max primops, so I guess this issue can be closed. Thanks for your help! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max --------------------------------------------+------------------------------ Reporter: arotenberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #6135 --------------------------------------------+------------------------------ Comment (by carter): @arontenberg, need not close it per se, I think @simonpj raise a good point (which perhaps can be spun off into its own ticket?) I notice Jan has a blog post about LNE http://lambda.jstolarek.com/2013/10/let-no-escape/ is it kinda like a higher order phi function? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9246: GHC generates poor code for repeated uses of min/max -------------------------------------+------------------------------------- Reporter: arotenberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: x86_64 (amd64) Unknown/Multiple | Difficulty: Unknown Type of failure: Runtime | Blocked By: performance bug | Related Tickets: #6135 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by gintas): * os: Windows => Unknown/Multiple -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9246#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC