[GHC] #14980: Runtime performance regression with binary operations on vectors

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Keywords: vector | Operating System: Unknown/Multiple bitwise operations | Architecture: | Type of failure: Runtime Unknown/Multiple | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- Let me briefly explain our use case: we have a machine learning tools created in Haskell. Basically it builds association rules for a database. In plain English: predicates on data rows. We need to score them, so we need to check how many rows are "matched" by the predicate. In order to optimize performance, our code uses two main representations for data: one is a "human readable", where a values are real values. The second one is binarized representation for categorical data. The latter has is actually a family of representation, since we pack bits into tuples of Word64 and use bitwise operation to implement logic. Simplified but representative code is attached. In GHC 8.0.2 binary representation is faster by one order of magnitude than the "human readable" one: {{{ ➜ ghc-bug stack exec performance-bug-pair-1 "Generated" benchmarking 256 columns/raw unbox vectors time 342.6 μs (338.3 μs .. 348.0 μs) 0.999 R² (0.998 R² .. 1.000 R²) mean 339.3 μs (337.5 μs .. 342.0 μs) std dev 7.203 μs (5.596 μs .. 10.07 μs) variance introduced by outliers: 13% (moderately inflated) benchmarking 256 columns/binary packed time 32.07 μs (29.69 μs .. 34.14 μs) 0.982 R² (0.976 R² .. 0.997 R²) mean 29.97 μs (29.41 μs .. 31.03 μs) std dev 2.428 μs (1.526 μs .. 3.750 μs) variance introduced by outliers: 78% (severely inflated) }}} In GHC 8.2.2 (and later) binary representation performance is similar to "human readable", so no performance gain: {{{ ➜ ghc-bug stack exec performance-bug-pair-1 "Generated" benchmarking 256 columns/raw unbox vectors time 442.4 μs (406.7 μs .. 474.5 μs) 0.978 R² (0.969 R² .. 0.993 R²) mean 399.3 μs (391.3 μs .. 415.1 μs) std dev 34.73 μs (20.36 μs .. 53.29 μs) variance introduced by outliers: 71% (severely inflated) benchmarking 256 columns/binary packed time 378.6 μs (295.8 μs .. 488.0 μs) 0.637 R² (0.492 R² .. 0.780 R²) mean 568.1 μs (437.1 μs .. 747.6 μs) std dev 308.7 μs (233.6 μs .. 386.1 μs) variance introduced by outliers: 98% (severely inflated) }}} However, for certain compilation paths, we still can get similar speedup as with GHC 8.0.2. In the above examples we used 4-tuple of Word64 as binary representation. In the following code we run two tests: one on just Word64 and one of 4-tuple of Word64. The difference is that we just add the Word64 case: {{{ ➜ ghc-bug stack exec performance-bug-pair-2 "Generated" benchmarking 64 columns/raw unbox vectors time 337.6 μs (336.1 μs .. 339.3 μs) 0.999 R² (0.998 R² .. 1.000 R²) mean 349.6 μs (344.4 μs .. 359.7 μs) std dev 23.22 μs (15.39 μs .. 38.22 μs) variance introduced by outliers: 60% (severely inflated) benchmarking 64 columns/binary packed time 21.66 μs (21.53 μs .. 21.79 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 21.68 μs (21.53 μs .. 21.89 μs) std dev 613.2 ns (466.0 ns .. 806.0 ns) variance introduced by outliers: 30% (moderately inflated) benchmarking 256 columns/raw unbox vectors time 344.5 μs (341.6 μs .. 348.2 μs) 0.999 R² (0.999 R² .. 1.000 R²) mean 345.1 μs (342.5 μs .. 349.3 μs) std dev 10.66 μs (7.997 μs .. 16.34 μs) variance introduced by outliers: 25% (moderately inflated) benchmarking 256 columns/binary packed time 28.04 μs (27.70 μs .. 28.46 μs) 0.999 R² (0.999 R² .. 1.000 R²) mean 28.05 μs (27.85 μs .. 28.30 μs) std dev 758.2 ns (628.2 ns .. 907.6 ns) variance introduced by outliers: 27% (moderately inflated) }}} I made a variant of code with simpler accumulating function (in our code we accumulate pair of integers, simplified accumulator work on one integer). GHC 8.2.2 in that case losses speed-up with 4-tuples: {{{ ➜ ghc-bug stack exec performance-bug-2 "Generated" benchmarking 64 columns/raw unbox vectors time 333.8 μs (333.0 μs .. 335.1 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 333.6 μs (332.4 μs .. 335.8 μs) std dev 5.651 μs (3.233 μs .. 9.582 μs) benchmarking 64 columns/binary packed time 39.06 μs (38.98 μs .. 39.14 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 38.94 μs (38.83 μs .. 39.14 μs) std dev 495.0 ns (310.2 ns .. 782.1 ns) benchmarking 256 columns/raw unbox vectors time 336.9 μs (336.2 μs .. 337.9 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 337.5 μs (336.2 μs .. 339.8 μs) std dev 5.757 μs (2.946 μs .. 8.979 μs) benchmarking 256 columns/binary packed time 239.8 μs (237.6 μs .. 243.0 μs) 0.998 R² (0.996 R² .. 0.999 R²) mean 251.4 μs (247.9 μs .. 259.8 μs) std dev 11.50 μs (5.662 μs .. 19.79 μs) variance introduced by outliers: 37% (moderately inflated) }}} In GHC 8.0.2 we have speed-up in both cases. What may be related: using random-fu to generate random numbers seems to produce broken code on GHC 8.2.2 (the `performance-bug-rfu.hs` source). Short description of attached sources: - performance-bug-pair-2.hs: using two binary representations - performance-bug-pair-1.hs: using one binary representation (one case commented) - performance-bug-1.hs: using one binary representation with simplified accumulator - performance-bug-2.hs: using two binary representation with simplified accumulator - performance-bug-rfu.hs: using random-fu to generate data (optional) - stack-8.0.yaml: stack file for GHC-8.0.2 - stack-8.2.yaml: stack file for GHC-8.2.2 - stack-nightly.yaml: stack file for GHC-8.4 - performance-bug.cabal: to be able to stack build everything. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by ttylec): * Attachment "ghc-bug.tar.gz" added. source code -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by ttylec): As a suplement: it is related to optimization. I usually use -O2, but effect is present also with -O2. With -O0 on GHC 8.0.2 there is no speed- up in binarized version. Program compiled with GHC 8.2.2 with -O0 crashes(!) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: (none) Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * priority: normal => high Comment: Thank you for the very complete ticket and repro! We'll have a look. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: (none) Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by ttylec): It came to my mind that different stack snapshots could bring different `vector` version along GHC. But that's not the case, both have 0.12.0.1. Please let me know if I can be of any assistance. I tried to debug it more but I have no CS background (I am mathematical physicist) so you need to treat me as a newbie. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.6.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * owner: (none) => bgamari * milestone: => 8.6.1 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Hmm, I spent all day trying to reproduce this, but I'm getting exactly the results you'd want. I built the package with cabal, once using ghc 8.0.2, and once using 8.2.2 (both installed from linux-x86_64 release bundles downloaded directly from haskell.org), and I get the same behavior across both compilers, with performance metrics similar to what you're seeing in the 8.0.2 case. Output from `performance-bug-pair1`, compiled with GHC 8.2.2: {{{ tobias@zoidberg:~/well-typed/devel/ghc-T14980/ghc-bug/ > performance-bug- pair-1 "Generated" benchmarking 256 columns/raw unbox vectors time 478.6 μs (477.8 μs .. 479.9 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 476.4 μs (476.0 μs .. 477.0 μs) std dev 1.501 μs (1.076 μs .. 2.501 μs) benchmarking 256 columns/binary packed time 275.2 μs (271.7 μs .. 278.6 μs) 0.999 R² (0.998 R² .. 1.000 R²) mean 280.2 μs (276.6 μs .. 283.2 μs) std dev 6.706 μs (5.090 μs .. 9.040 μs) variance introduced by outliers: 13% (moderately inflated) }}} And with 8.0.2: {{{ "Generated" benchmarking 256 columns/raw unbox vectors time 476.9 μs (474.4 μs .. 480.5 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 473.5 μs (472.5 μs .. 475.1 μs) std dev 4.097 μs (2.212 μs .. 7.591 μs) benchmarking 256 columns/binary packed time 291.3 μs (286.8 μs .. 295.4 μs) 0.999 R² (0.998 R² .. 1.000 R²) mean 297.4 μs (293.6 μs .. 300.1 μs) std dev 6.918 μs (4.971 μs .. 9.873 μs) variance introduced by outliers: 12% (moderately inflated) }}} So there must be something about my setup that makes the bug disappear. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by ttylec): Hm... not exactly the bug disappear, you observed no speedup with "binary packed" version in the first place. Notice that in my benchmark, the "binary packed" version is an order of magnitude faster that the "unbox vectors" and the bug is about loosing that speed-up when we compile with 8.2.2 (and later) In your case, there was no speed-up in the first place. May I ask you to check also `stack exec performance-bug-pair-2` and `stack exec performance-bug-2`? I am curious on what machine/system you did tested it? Oh, and obviously optimization must be enabled (in case you didn't `stack build` it). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Hm... not exactly the bug disappear, you observed no speedup with "binary packed" version in the first place. Notice that in my benchmark,
In your case, there was no speed-up in the first place. May I ask you to check also `stack exec performance-bug-pair-2` and `stack exec
#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:7 ttylec]: the "binary packed" version is an order of magnitude faster that the "unbox vectors" and the bug is about loosing that speed-up when we compile with 8.2.2 (and later) Indeed, I noticed. performance-bug-2`? Stack failed on a cassava thing, I'm not exactly sure how to fix it, so I tried cabal first. If stack vs. cabal is the problem, however, then most likely that means it's a library problem rather than a GHC bug. I'll see if I can sort it out though, so that maybe I can reproduce your results. However, I don't think optimization settings could be the problem - after all, these are specified in the .cabal file, not stack.yaml, I have not overridden anything, and I verified that -O2 actually gets passed to GHC. I have even compiled the programs manually, merely pointing GHC to the cabal sandbox's package cache, and got the same results regardless. Results for `performance-bug-2` are similar. 8.0.2: {{{ "Generated" benchmarking 64 columns/raw unbox vectors time 445.2 μs (445.1 μs .. 445.4 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 443.4 μs (442.9 μs .. 443.8 μs) std dev 1.366 μs (1.077 μs .. 1.655 μs) benchmarking 64 columns/binary packed time 51.16 μs (51.09 μs .. 51.24 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 50.95 μs (50.90 μs .. 51.01 μs) std dev 204.4 ns (159.7 ns .. 264.0 ns) benchmarking 256 columns/raw unbox vectors time 443.9 μs (443.6 μs .. 444.4 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 442.7 μs (442.1 μs .. 444.1 μs) std dev 2.711 μs (1.414 μs .. 5.048 μs) benchmarking 256 columns/binary packed time 260.4 μs (255.3 μs .. 266.2 μs) 0.997 R² (0.996 R² .. 0.998 R²) mean 266.6 μs (263.1 μs .. 271.5 μs) std dev 9.366 μs (6.649 μs .. 13.29 μs) variance introduced by outliers: 24% (moderately inflated) }}} 8.2.2: {{{ "Generated" benchmarking 64 columns/raw unbox vectors time 445.0 μs (444.7 μs .. 445.2 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 444.0 μs (443.2 μs .. 447.0 μs) std dev 4.654 μs (1.118 μs .. 9.693 μs) benchmarking 64 columns/binary packed time 51.13 μs (51.11 μs .. 51.15 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 50.90 μs (50.86 μs .. 50.94 μs) std dev 146.1 ns (122.9 ns .. 181.2 ns) benchmarking 256 columns/raw unbox vectors time 440.4 μs (440.1 μs .. 440.5 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 437.9 μs (437.3 μs .. 438.4 μs) std dev 1.797 μs (1.576 μs .. 2.095 μs) benchmarking 256 columns/binary packed time 289.5 μs (285.1 μs .. 294.3 μs) 0.998 R² (0.998 R² .. 0.999 R²) mean 295.6 μs (292.0 μs .. 299.8 μs) std dev 8.814 μs (6.656 μs .. 11.69 μs) variance introduced by outliers: 19% (moderately inflated) }}}
I am curious on what machine/system you did tested it? Oh, and obviously optimization must be enabled (in case you didn't `stack build` it).
Debian 9, x86_64. Intel i5 CPU, 4 GB RAM, official GHC release builds. I have a few possible explanations as to why we're seeing these differences: - Stack may be pulling in other GHC versions than the release bundles - Stack may be pulling in different version of some crucial library - Whatever platform you run on might trigger different code paths in GHC I'll investigate further. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): OK, some more experimenting. Changed the .cabal file to force `vector` to `0.12.0.1` (the latest version on hackage); no difference, results are still more or less the same on both compilers. Also managed to get things built with stack; the 8.2 results are also very similar: {{{ "Generated" benchmarking 64 columns/raw unbox vectors time 443.6 μs (443.2 μs .. 443.9 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 441.6 μs (441.2 μs .. 442.0 μs) std dev 1.364 μs (1.092 μs .. 1.738 μs) benchmarking 64 columns/binary packed time 51.35 μs (51.29 μs .. 51.44 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 51.07 μs (51.00 μs .. 51.18 μs) std dev 265.9 ns (176.1 ns .. 453.5 ns) benchmarking 256 columns/raw unbox vectors time 446.1 μs (445.5 μs .. 447.2 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 443.9 μs (443.4 μs .. 444.6 μs) std dev 1.883 μs (1.337 μs .. 3.374 μs) benchmarking 256 columns/binary packed time 291.2 μs (286.2 μs .. 296.9 μs) 0.998 R² (0.998 R² .. 0.999 R²) mean 298.3 μs (294.2 μs .. 301.2 μs) std dev 7.891 μs (5.688 μs .. 10.20 μs) variance introduced by outliers: 15% (moderately inflated) }}} Stack-installed, with 8.0.2: {{{ "Generated" benchmarking 64 columns/raw unbox vectors time 412.3 μs (411.7 μs .. 413.1 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 412.4 μs (412.0 μs .. 413.1 μs) std dev 1.643 μs (1.124 μs .. 2.815 μs) benchmarking 64 columns/binary packed time 21.89 μs (21.87 μs .. 21.90 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 21.90 μs (21.89 μs .. 21.91 μs) std dev 38.13 ns (26.94 ns .. 54.15 ns) benchmarking 256 columns/raw unbox vectors time 417.6 μs (416.0 μs .. 419.5 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 417.2 μs (416.5 μs .. 419.1 μs) std dev 3.403 μs (935.8 ns .. 6.896 μs) benchmarking 256 columns/binary packed time 29.56 μs (29.51 μs .. 29.63 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 29.52 μs (29.51 μs .. 29.56 μs) std dev 71.00 ns (36.37 ns .. 131.9 ns) }}} So apparently compiling with 8.0.2 through cabal gives different results than compiling though stack. Strange. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Even more strangely, I am now getting 8.0.2-like results using 8.2.2: {{{ "Generated" benchmarking 64 columns/raw unbox vectors time 417.9 μs (417.5 μs .. 418.3 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 418.3 μs (417.8 μs .. 419.6 μs) std dev 2.364 μs (827.2 ns .. 4.744 μs) benchmarking 64 columns/binary packed time 22.95 μs (22.90 μs .. 23.01 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 22.94 μs (22.91 μs .. 22.97 μs) std dev 105.9 ns (75.87 ns .. 146.9 ns) benchmarking 256 columns/raw unbox vectors time 432.7 μs (422.4 μs .. 443.9 μs) 0.997 R² (0.995 R² .. 0.999 R²) mean 423.4 μs (420.3 μs .. 428.6 μs) std dev 13.41 μs (7.712 μs .. 19.14 μs) variance introduced by outliers: 25% (moderately inflated) benchmarking 256 columns/binary packed time 29.75 μs (29.51 μs .. 30.16 μs) 0.999 R² (0.999 R² .. 1.000 R²) mean 29.64 μs (29.57 μs .. 29.86 μs) std dev 395.1 ns (120.1 ns .. 801.2 ns) }}} I'm starting to suspect that the decisive factor here is not GHC itself, but rather something else. Quite possibly, library versions play a role - the cabal builds for 8.0.2 and 8.2.2 use the same versions for all libraries (except base), while the stack builds differ in a few of them. Another possible explanation could be that while I'm forcing the compiler itself when compiling with cabal (`--with-compiler` / `--with-ghc`), I am not doing the same for any of the other tools cabal may be using, and those might not resolve in a compiler-specific way but rather just use whatever is installed system-wide (which happens to be from GHC 8.0.2 right now, but was 8.4.2 when I first started profiling). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Perhaps the difference between stack and cabal come from the two making difference choices about which versions of which libraries they choose. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): I have written a script to analyze the library selections in the 4 different build scenarios. Cabal with GHC 8.0.2 and cabal with GHC 8.2.2 both pull in the exact same libraries (except for `base`, obv.); so I only compared cabal with GHC 8.0.2 (`good1`), stack LTS 9.21 (`good2`), and stack LTS 11.1 (`bad`). I deleted `~/.stack`, and made clean builds for each of the three and verified that the `good` ones expose the desired behavior, and the `bad` one regresses. Then I listed the packages that got installed (from `.stack/snapshots/...` and `.cabal-sandbox/.../...packages.conf.d`), and compared the three lists. I threw out all the packages that met any of the following criteria: - the package is not in all 3 lists (meaning that it's not a direct dependency, and the difference must also be reflected in some other package choice) - the package version for the `bad` list matches one of the `good` lists exactly (meaning that it cannot be the culprit) The remaining list is this (columns giving the installed versions for the `good1`, `good2` and `bad` lists in this order): {{{ aeson: 1.4.0.0 | 1.1.2.0 | 1.2.4.0 criterion: 1.4.1.0 | 1.1.4.0 | 1.3.0.0 primitive: 0.6.4.0 | 0.6.2.0 | 0.6.3.0 scientific: 0.3.6.2 | 0.3.5.2 | 0.3.5.3 terminal-ansi: 0.8.0.4 | 0.6.3.1 | 0.8.0.2 }}} Which is interesting, in that the `bad` versions are all between the `good1` and `good2` versions. So if any of these cause the badness, then the problem must be a regression that wasn't present in the first version, and has been fixed between the second and the third version. Another remarkable fact is that Cabal installs the exact same versions of all the libraries for GHC 8.0.2 and GHC 8.2.2, despite `base` versions being different - this could mean one of two things: a) Cabal fails to detect the correct version of `base` (using the system- wide GHC install instead of the one demanded by `--with-compiler`), so we end up resolving libraries for 8.0.2, and it just happens to compile cleanly with 8.2.2 anyway; or b) All the required libraries happen to have bounds on `base` that are compatible with both 8.0.2 and 8.2.2, so Cabal's resolver ends up producing the exact same solution for both compilers. a) would be a bug, b) would be a lucky coincidence. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): If the badness is only for a particular stack LTS, I wonder if the badness will go away in the next stack LTS and we can declare this ticket moot? I'd suggest the next thing to investigate: do a profiled build and see where the factor of 100 is going. It's such a huge change it should be blindingly obvious :-). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

= Data.Vector.Fusion.Util Data/Vector/Fusion/Util.hs:36:3-18 12.4 19.9 compile'.\ Main performance- bug-2.hs:40:22-33 11.4 4.3 basicUnsafeIndexM Data.Vector.Primitive Data/Vector/Primitive.hs:222:3-75 10.9 16.1 matchPacked Main performance- bug-2.hs:(108,1)-(111,27) 8.8 11.0 uniform System.Random.MWC System/Random/MWC.hs:217:5-33 8.7 15.0 basicUnsafeIndexM Data.Vector.Unboxed.Base Data/Vector/Unboxed/Base.hs:266:841-899 6.6 0.0 matchPacked.go Main performance- bug-2.hs:(110,5)-(111,27) 3.7 0.4 = Data.Vector.Fusion.Util Data/Vector/Fusion/Util.hs:50:3-19 3.3 2.6 getOverhead Criterion.Monad Criterion/Monad.hs:(47,1)-(56,12) 3.3 0.0 basicUnsafeWrite Data.Vector.Primitive.Mutable Data/Vector/Primitive/Mutable.hs:115:3-69 2.7 2.3 basicUnsafeIndexM Data.Vector.Unboxed.Base Data/Vector/Unboxed/Base.hs:345:3-73 2.6 11.2 matched.matches Main performance- bug-2.hs:28:5-27 2.4 0.0 basicUnsafeIndexM Data.Vector.Unboxed.Base internal/unbox-tuple- instances:(452,3)-(458,29) 2.2 0.0 compile' Main performance- bug-2.hs:40:1-33 1.7 0.0 compile.cc Main performance- bug-2.hs:37:5-28 1.6 0.0 getGCStats Criterion.Measurement Criterion/Measurement.hs:(46,1)-(48,16) 1.5 0.0 matched Main performance- bug-2.hs:(23,1)-(29,23) 1.5 2.4 basicLength Data.Vector.Unboxed.Base Data/Vector/Unboxed/Base.hs:343:3-42 1.1 0.0 fmap Data.Vector.Fusion.Stream.Monadic Data/Vector/Fusion/Stream/Monadic.hs:(133,3)-(135,20) 0.7 1.0 toPacked4.inst Main performance- bug-2.hs:88:5-26 0.6 1.1
= Data.Vector.Fusion.Util Data/Vector/Fusion/Util.hs:36:3-18 19.4 26.9 basicUnsafeIndexM Data.Vector.Primitive Data/Vector/Primitive.hs:222:3-75 14.8 18.3 compile'.\ Main performance- bug-2.hs:40:22-33 13.0 5.0 matchPacked Main performance- bug-2.hs:(108,1)-(111,27) 7.2 8.6 = Data.Vector.Fusion.Util Data/Vector/Fusion/Util.hs:50:3-19 4.2 3.0 basicUnsafeIndexM Data.Vector.Unboxed.Base Data/Vector/Unboxed/Base.hs:266:841-899 3.9 0.0 matchPacked.go Main performance- bug-2.hs:(110,5)-(111,27) 3.4 0.4
#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Profiled builds don't seem to reveal much, because as soon as I enable profiling, the performance difference between LTS 9 / GHC 8.0.2 and LTS 11 / GHC 8.2.2 disappears almost entirely - in fact, a profiling build from LTS 9 is *slower* than one from LTS 11. Profile output from LTS 9: {{{ Tue Jun 26 16:12 2018 Time and Allocation Profiling Report (Final) performance-bug-2 +RTS -p -RTS total time = 28.28 secs (28277 ticks @ 1000 us, 1 processor) total alloc = 24,003,734,896 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc pack64bit Main performance- bug-2.hs:(59,1)-(62,23) 0.6 1.8 }}} And LTS 11: {{{ Tue Jun 26 14:26 2018 Time and Allocation Profiling Report (Final) performance-bug-2 +RTS -p -RTS total time = 23.79 secs (23791 ticks @ 1000 us, 1 processor) total alloc = 17,329,594,624 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc primitive Control.Monad.Primitive Control/Monad/Primitive.hs:178:3-16 3.2 0.2 basicUnsafeIndexM Data.Vector.Unboxed.Base Data/Vector/Unboxed/Base.hs:345:3-73 2.9 13.2 matched.matches Main performance- bug-2.hs:28:5-27 2.7 0.0 uniform System.Random.MWC System/Random/MWC.hs:217:5-33 2.0 0.9 matched Main performance- bug-2.hs:(23,1)-(29,23) 2.0 2.8 basicUnsafeIndexM Data.Vector.Unboxed.Base internal/unbox-tuple- instances:(452,3)-(458,29) 1.8 0.0 compile' Main performance- bug-2.hs:40:1-33 1.7 0.0 basicUnsafeWrite Data.Vector.Primitive.Mutable Data/Vector/Primitive/Mutable.hs:115:3-69 1.5 4.2 compile.cc Main performance- bug-2.hs:37:5-28 1.5 0.0 basicLength Data.Vector.Unboxed.Base Data/Vector/Unboxed/Base.hs:343:3-42 1.2 0.0 toPacked4.inst Main performance- bug-2.hs:88:5-26 0.9 1.5 basicUnsafeRead Data.Vector.Primitive.Mutable Data/Vector/Primitive/Mutable.hs:112:3-63 0.7 1.7 fmap Data.Vector.Fusion.Stream.Monadic Data/Vector/Fusion/Stream/Monadic.hs:(133,3)-(135,20) 0.7 1.1 take64.(...) Main performance- bug-2.hs:66:17-38 0.6 2.3 }}} And the main culprits are roughly the same, give or take ordering. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:13 simonpj]:
If the badness is only for a particular stack LTS, I wonder if the badness will go away in the next stack LTS and we can declare this ticket moot?
That would of course be a "beautiful" solution. I could try pinning it down to a specific library, and force that to the newest version in the LTS 11.1 build, see if that makes any difference. Another thing to try would be to create a `cabal.config` with the exact library versions that stack uses; if that reproduces the badness, then it's the libraries, but if it doesn't, then the most likely explanation is that cabal does something wrong. Yet another thing would be to set up two VMs, each of them having only one GHC toolchain installed, one for 8.0.2 and one for 8.2.2, and see if *that* has us reproduce the observed behavior. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by ttylec): Indeed, sorry I didn't write earlier. But enabling profiling kills the performance boost. I suspect that the order-of-magnitude speedup is due to some really low level optimization. When the analogous code is run as a part of bigger code, the `unboxed vector` version scales across cores linearly up to 20 sth cores, while the `binary packed` does not scale linearly even with 2 cores. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by ttylec): Moreover, I try to reproduce my previous results (I had plan to switch to hs-gauge from criterion to lighten dependencies) and now I cannot reproduce results on my machine (?!). I consistently get order-of-magnitude speedup with ghc-8.0.2, ghc-8.2.2 (from lts-11.1 and lts-11.15) as well as ghc-8.4.3 (nightly snapshot). I try to figure out what changed... and the only difference I find is the fact that I installed newest ubuntu in mean time. And I have newer version of `stack` tool. This is totally confusing. GHC is installed by stack from binaries too... but I guess that packages tagged with version number don't change in time. BTW, times I get currently: {{{ (zettafox) ➜ ghc-bug stack exec performance-bug-pair-2 "Generated" benchmarking 64 columns/raw unbox vectors time 377.3 μs (372.6 μs .. 383.3 μs) 0.998 R² (0.996 R² .. 0.999 R²) mean 379.4 μs (375.5 μs .. 387.9 μs) std dev 19.02 μs (12.01 μs .. 30.94 μs) variance introduced by outliers: 46% (moderately inflated) benchmarking 64 columns/binary packed time 21.32 μs (21.20 μs .. 21.45 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 21.21 μs (21.10 μs .. 21.35 μs) std dev 426.0 ns (334.1 ns .. 567.0 ns) variance introduced by outliers: 18% (moderately inflated) benchmarking 256 columns/raw unbox vectors time 373.3 μs (366.7 μs .. 381.9 μs) 0.997 R² (0.994 R² .. 0.999 R²) mean 368.1 μs (365.6 μs .. 374.9 μs) std dev 12.72 μs (8.188 μs .. 22.90 μs) variance introduced by outliers: 29% (moderately inflated) benchmarking 256 columns/binary packed time 27.32 μs (27.14 μs .. 27.53 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 27.30 μs (27.16 μs .. 27.49 μs) std dev 535.0 ns (432.6 ns .. 704.7 ns) variance introduced by outliers: 17% (moderately inflated) }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): One thing to try would be to delete `~/.stack` and `~/.stack-work` between builds. This should give you a clean slate. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Speaking of clean slates; I just noticed that `$PATH` can mess up the testing, because stack defaults to installing binaries into `~/.local/bin/`, while cabal installs either into `~/.cabal/bin` (if no sandbox is used), or into `./.cabal-sandbox/bin` (if a sandbox is used). So you have to be careful to make sure you're running the correct binary, and just relying on `$PATH` might explain while switching between configurations or build tools might appear to leave some residue. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Another data point: I dumped stack's `cabal.config` for the LTS 11.1 release into my cabal project and compiled with `--with- compiler=/usr/local/bin/ghc-8.2.2`, and the result I'm getting is this: {{{ "Generated" benchmarking 64 columns/raw unbox vectors time 460.0 μs (447.0 μs .. 473.6 μs) 0.995 R² (0.992 R² .. 0.997 R²) mean 446.4 μs (440.2 μs .. 455.1 μs) std dev 24.42 μs (17.29 μs .. 31.22 μs) variance introduced by outliers: 48% (moderately inflated) benchmarking 64 columns/binary packed time 52.25 μs (51.66 μs .. 53.04 μs) 0.998 R² (0.997 R² .. 0.999 R²) mean 52.60 μs (51.99 μs .. 53.67 μs) std dev 2.665 μs (1.919 μs .. 4.073 μs) variance introduced by outliers: 55% (severely inflated) benchmarking 256 columns/raw unbox vectors time 439.9 μs (434.4 μs .. 447.4 μs) 0.998 R² (0.997 R² .. 1.000 R²) mean 439.0 μs (435.0 μs .. 446.4 μs) std dev 17.95 μs (10.79 μs .. 27.38 μs) variance introduced by outliers: 35% (moderately inflated) benchmarking 256 columns/binary packed time 304.7 μs (288.7 μs .. 330.4 μs) 0.965 R² (0.940 R² .. 0.998 R²) mean 324.1 μs (302.9 μs .. 364.4 μs) std dev 62.33 μs (26.19 μs .. 97.61 μs) variance introduced by outliers: 91% (severely inflated) }}} However, I'm getting the same bad behavior when I pin libraries to LTS 11.15, and also when I don't pin libraries at all. So it's not the libraries. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Performed an evil experiment: I put the `cabal.config` from LTS-11.1 into the project, then compiled with cabal, forcing the compiler to 8.0.2. Behavior is "good" (i.e., the 4th benchmark is in the 20-40 μs range. I also get the "good" behavior when compiling with 8.0.2, but without pinning. So it's definitely not the libraries. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by ttylec): Replying to [comment:20 tdammers]: This is not totally "bad" behavior. This:
{{{ "Generated" benchmarking 64 columns/raw unbox vectors time 460.0 μs (447.0 μs .. 473.6 μs) 0.995 R² (0.992 R² .. 0.997 R²) mean 446.4 μs (440.2 μs .. 455.1 μs) std dev 24.42 μs (17.29 μs .. 31.22 μs) variance introduced by outliers: 48% (moderately inflated)
benchmarking 64 columns/binary packed time 52.25 μs (51.66 μs .. 53.04 μs) 0.998 R² (0.997 R² .. 0.999 R²) mean 52.60 μs (51.99 μs .. 53.67 μs) std dev 2.665 μs (1.919 μs .. 4.073 μs) variance introduced by outliers: 55% (severely inflated) }}}
this is "good", we have significant speedup. But this:
{{{ benchmarking 256 columns/raw unbox vectors time 439.9 μs (434.4 μs .. 447.4 μs) 0.998 R² (0.997 R² .. 1.000 R²) mean 439.0 μs (435.0 μs .. 446.4 μs) std dev 17.95 μs (10.79 μs .. 27.38 μs) variance introduced by outliers: 35% (moderately inflated)
benchmarking 256 columns/binary packed time 304.7 μs (288.7 μs .. 330.4 μs) 0.965 R² (0.940 R² .. 0.998 R²) mean 324.1 μs (302.9 μs .. 364.4 μs) std dev 62.33 μs (26.19 μs .. 97.61 μs) variance introduced by outliers: 91% (severely inflated) }}}
is "bad". However, what I observed with the full code of our project, the speed-up is lost when we exceed the specific number of columns... but that number is platform specific (AMD performs worst, Intel is usually good, but then to MacBook Pro i5 CPU seems to be better than in i7 Lenovo on ubuntu). But since on the same platform you get different results for 256 columns, having speedup in with 64 columns, make me wonder, can the system kernel and/or libraries be affecting that? As for me not being able to reproduce my original report: I did try to remove `~/.stack` and `~/.stack-work`. Tried both with `split-obj` in stack config and without. I still can't get "bad" results on my current hardware/software. I am using only stack, I don't have system-wide GHC. I will try to do more tests on different machine (with debian 9); try the macbook at home too. But to sum up what we know until now: libraries are ruled out, compiler version seems to be ruled out too. What's left? GHC binary package, OS kernel, system libs? Does anything of that make sense? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): So, to summarize the results I have so far on my own machine (i5, debian): - The 64 columns version sees massive improvement throughout, on any compiler, and with any set of libraries; however, when the 256-columns version performs badly, the 64-columns version is slower by a factor of about 2 (~50 μs vs. ~25 μs). - Whether the improvement is seen or not does not seem to depend on the libraries - the same set of libraries, when forced with cabal, will sometimes show improvement, and sometimes not. - I have both 8.0.2 and 8.2.2 produce "good" and "bad" behavior; however, some of that may be due to accidentally running the wrong binary. The test runs in which I made absolutely sure I was running the correct binaries, and started with a clean slate, do show that 8.0.2 consistently produces "good" code, while 8.2.2 consistently produces "bad" code. So that leaves us with relatively obscure things. One thing to check would be which exact binaries stack pulls in. The GHC versions may match, but the build configuration might not. Kernel and system libs were all identical between my attempts, so I doubt those play a role. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): This smells to me like a crucial fusion rule not firing. You could try compiling with `-ddump-rule-firings` and see if the "good" and "bad" output differs. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Ah yes, great. I had already suspected fusion rules, but hadn't thought of `-ddump-rule-firings`. I'll give it a spin. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): OK, for a very rough first impression of `RULES` that fire in either case, I did this: {{{#!diff tobias@zoidberg:~/well-typed/devel/ghc-T14980/ > diff <(grep '^Rule' ghc- bug-8.0.2/rule-firings | sort -u) <(grep '^Rule' ghc-bug-8.2.2/rule- firings | sed -e 's/\s(.*)$//' | sort -u) -u | grep '^[+-]' --- /proc/self/fd/13 2018-06-28 12:26:37.103065551 +0200 +++ /proc/self/fd/14 2018-06-28 12:26:37.103065551 +0200 -Rule fired: >=# +Rule fired: ==# +Rule fired: ># -Rule fired: Class op < -Rule fired: Class op <*> -Rule fired: Class op >= -Rule fired: Class op fmap +Rule fired: Class op liftA2 -Rule fired: Class op $p1Applicative -Rule fired: eftIntList +Rule fired: divInt# +Rule fired: foldr2/left -Rule fired: narrow32Word# -Rule fired: narrow8Word# -Rule fired: or# -Rule fired: SC:foldlM_loop0 -Rule fired: SC:foldlM_loop1 +Rule fired: SC:go0 -Rule fired: SC:$j1 +Rule fired: SC:$wfoldlM_loop0 -Rule fired: seq of cast +Rule fired: stream/unstream [Vector] +Rule fired: uncheckedShiftL# }}} That is, I compiled the `performance-bug-2` program with both compilers using `-ddump-rule-firings`, and then filtered the dump as follows: 1. Retain only the actual rule firings 2. Sort by rule name 3. Remove duplicates 4. Strip the extra information that is only present in 8.2.2 output 5. Diff 6. Retain only + and - lines. `+` lines are rules that only fire on 8.2.2, `-` lines fire only on 8.0.2. I don't know which of these makes the decisive difference, but I suspect it's one of the built-in rules. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): While I've not looked at the program, I would guess `eftIntList`. That allows us to deforest a consumed list produced by `enumFromTo` -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14980: Runtime performance regression with binary operations on vectors -------------------------------------+------------------------------------- Reporter: ttylec | Owner: bgamari Type: bug | Status: new Priority: high | Milestone: 8.8.1 Component: Compiler | Version: 8.2.2 Resolution: | Keywords: vector | bitwise operations Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by ttylec): Sorry for long inactivity. I had some time to do more tests recently, based on your suggestions. To focus attention, I tried to investigate difference between: - `performance-bug-pair-1`: 256 columns - `performance-bug-pair-2`: 64 and 256, the same code as in `performance- bug-pair-1` but with 64 column check uncommented On ghc-8.2.2 and ghc-8.0.2. - on ghc-8.2.2, `performance-bug-pair-2` behaves as expected -- we have speedup for both 64 and 256 columns, `performance-bug-pair-1` does not have speed-up. - on ghc-8.0.2, I see speedup in both cases. Based on @tdammers example, I tried to print fired rules difference between both `performance-bug-pair-1` and `performance-bug-pair-2`, using analogous diff expression: {{{ ➜ ghc-bug diff <(grep '^Rule' rules-firing1-1 | sort -u) <(grep '^Rule' rules-firing-2 | sort -u) -u | grep '^[+-]' --- /proc/self/fd/13 2018-10-23 15:49:27.063541325 +0200 +++ /proc/self/fd/14 2018-10-23 15:49:27.067541342 +0200 -Rule fired: eftIntList (GHC.Enum) -Rule fired: stream/unstream [Vector] (Data.Vector.Generic) +Rule fired: SPEC matchPacked @ BPack4 (Main) +Rule fired: SPEC matchPacked @ BPack (Main) }}} `-` are rules that fired for `performance-bug-pair-1` (the bad) and `+` are rules that fired for `performance-bug-pair-2`. It's not about specialization, because I even tried to compile with `performance-bug-pair-1` with explicit monomorphic type. The only common rule on both rules, that fired in bad case and did not fire in good is `Rule fired: stream/unstream [Vector] (Data.Vector.Generic)`. Not that I understand it... but it seems that firing that rule interferes with some GHC optimizations? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14980#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC