[GHC] #16004: Vector performance regression in GHC 8.6

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.3 Component: Compiler | Version: 8.6.2 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime Unknown/Multiple | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- Hello. With the following code, I can observe a performance regression between ghc 8.4 and 8.6: {{{#!haskell {-# LANGUAGE ScopedTypeVariables #-} module Main where import qualified Data.Vector.Unboxed.Mutable as Vector import qualified Data.Vector.Unboxed as VectorU import Data.Foldable (for_) main :: IO () main = do let n = 1000 let vUnmutable :: VectorU.Vector Double = VectorU.generate (n * n) (\i -> fromIntegral i) v :: Vector.IOVector Double <- VectorU.unsafeThaw vUnmutable for_ [0..(n - 1)] $ \k -> do for_ [0..(n - 1)] $ \i -> do for_ [0..(n - 1)] $ \j -> do a <- Vector.unsafeRead v (i * n + k) b <- Vector.unsafeRead v (k * n + j) c <- Vector.unsafeRead v (i * n + j) Vector.unsafeWrite v (i * n + j) (min (a + b) c) }}} Built with `-O2` and with / without `-fllvm`. I'm using `vector-0.12.0.1`. Here are the timing results: GHC 8.2.2 no llvm: 1.7s llvm: 1.0s GHC 8.4.4 no llvm: 1.6s llvm: 0.9s GHC 8.6.2 no llvm: 4.8s llvm: 4.3s I'm using the following bash + nix script to gather theses timings: {{{#!bash nix-shell -p 'haskell.packages.ghc822.ghcWithPackages(p: [p.vector])' --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce- recomp; time ./FloydBench" nix-shell -p 'haskell.packages.ghc822.ghcWithPackages(p: [p.vector])' -p llvm_39 --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce-recomp -fllvm; time ./FloydBench" nix-shell -p 'haskell.packages.ghc844.ghcWithPackages(p: [p.vector])' --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce- recomp; time ./FloydBench" nix-shell -p 'haskell.packages.ghc844.ghcWithPackages(p: [p.vector])' -p llvm_5 --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce-recomp -fllvm; time ./FloydBench" nix-shell -p 'haskell.packages.ghc862.ghcWithPackages(p: [p.vector])' --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce- recomp; time ./FloydBench" nix-shell -p 'haskell.packages.ghc862.ghcWithPackages(p: [p.vector])' -p llvm_6 --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce-recomp -fllvm; time ./FloydBench" }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.3 Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by alpmestan): * cc: alpmestan (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.3 Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by carter): Ouch! Thanks for reporting this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.3 Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by RyanGlScott): * cc: simonpj (added) Comment: This regression was introduced in commit 3d38e8284b7382844f9862e8d8afbae9c7248b09 (`Do not unpack class dictionaries with INLINABLE`). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.3 Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Reproduced with 8.4.3 and 8.6.1 For the original code using -fno-full-laziness performance is almost the same for 8.4 and 8.6, and what little difference there is probably comes from using a different branch order at the Cmm level. {{{ $ ~/bench-exe.exe ./test-8.6-nofloat.exe -- ./test-8.4.exe benchmarking execute: ./test-8.6-nofloat.exe time 1.632 s (1.352 s .. 1.859 s) 0.997 R² (0.988 R² .. 1.000 R²) mean 1.629 s (1.600 s .. 1.658 s) std dev 49.03 ms (0.0 s .. 49.85 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking execute: ./test-8.4.exe time 1.646 s (1.493 s .. 1.863 s) 0.998 R² (0.994 R² .. NaN R²) mean 1.597 s (1.560 s .. 1.622 s) std dev 37.88 ms (0.0 s .. 43.65 ms) variance introduced by outliers: 19% (moderately inflated) }}} The difference between full-laziness not seems to be that with full- laziness we float out the creation of the [0..n] list, instead of transforming the code into a simple loop as intended. So we end up with this inner loop that passes around the list explicitly. I assume deforestation fails here? {{{ joinrec { go2_s7mc go2_s7mc ds3_X6W3 eta2_X3r = case ds3_X6W3 of { [] -> jump exit2_XF eta2_X3r; : y3_X6Yn ys2_X6Yq -> case readDoubleArray# (ipv1_a6zy `cast` Co:50) (+# (*# x1_a5F9 1000#) x_a69v) (eta2_X3r `cast` Co:14) of { (# ipv4_X6am, ipv5_X6ao #) -> case y3_X6Yn of { I# y4_X6c8 -> case writeDoubleArray# (ipv1_a6zy `cast` Co:50) (+# (*# x1_a5F9 1000#) y4_X6c8) ipv5_X6ao ipv4_X6am of s'#1_X6b3 { __DEFAULT -> jump go2_s7mc ys2_X6Yq (s'#1_X6b3 `cast` Co:13) } } } }; } }}} Not an export in the deforestation machinery so leaving that for someone else. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.3 Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
with full-laziness we float out the creation of the [0..n] list,
This particular thing is quite delicate. Consider {{{ f xs = map (\x -> foo x [1..100]) xs }}} Should we float that `[1..100]` out to top level, and share it among all the calls that `map` makes? Or should we reconstruct it on every call. In the latter case we might be able to fuse it with the consumer (perhaps we inline `foo`). But even if we fuse away the list we might still allocate heap objects for `I# 1`, `I# 2`, ... `I# 100`, for each element of `xs` rather sharing those values between all those calls. Or, perhaps, we get to fuse away those `I#` boxes too! It all depends on `foo`. I don't really have a good answer here. Sometimes full laziness is a win, sometimes not. There is more discussion in #7206. I'd love someone to work on this a bit. Somehow we should do better than we are doing. PS: I have no idea why the particular commit that made this test worse did make it worse; that might be worth investigation. Another thing that would be v useful is to distil the example into a standalone test case. `vector` is a complicated library, but David's analysis above isolates the issue well. Maybe someone can turn that into a repro case? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.3 Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by carter): Simon,I think you mean Andreas, not David ;) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.3 Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by guibou): Simon: Is the following example without `vector` interesting as a standalone test case: {{{#!haskell import Data.Foldable (for_) main :: IO () main = do let n = 1000 for_ [0 :: Int ..(n - 1)] $ \_k -> do for_ [0 :: Int ..(n - 1)] $ \_i -> do for_ [0 :: Int ..(n - 1)] $ \_j -> do pure () }}} Timings: ghc 8.2.2: 0.667s ghc 8.4.4: 0.357s ghc 8.6.2: 1.007s Note that I have only included the results without using `-fllvm` because apparently the llvm backend is "smart" enough to understand that the loops can be removed and results in a runtime of 0.001s. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.3 Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Yes, that's great. I think we know why this is happening too. Can someone try the patch in #7206, specifically `wip/cheap-build-osa1`, and see if it nails this case? And perhaps do a new nofib run. The conclusion on #7206 was that we should probably accept the patch -- and if it cures this vector problem that'd be a strong positive reason to do so. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by maoe): * cc: maoe (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16004: Vector performance regression in GHC 8.6 -------------------------------------+------------------------------------- Reporter: guibou | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by nh2): * cc: nh2 (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16004#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC