potential for GHC benchmarks w.r.t. optimisations being incorrect

I am admittedly unsure of how GHC's optimisation benchmarks are currently implemented/carried out, but I feel as though this paper and its findings could be relevant to GHC devs: http://cis.upenn.edu/~cis501/papers/producing-wrong-data.pdf Basically, according to this paper, the cache effects of changing where the stack starts based on the number of environment variables are huge for many compiler benchmarks, and adjusting for this effect shows that gcc -O3 is only in actuality 1% faster than gcc -O2. Some further thoughts, per http://aftermath.rocks/2016/04/11/wrong_data/ : "The question they looked at was the following: does the compiler’s -O3 optimization flag result in speedups over -O2? This question is investigated in the light of measurement biases caused by two sources: Unix environment size, and linking order. to the total size of the representation of Unix environment variables (such as PATH, HOME, etc.). Typically, these variables are part of the memory image of each process. The call stack begins where the environment ends. This gives rise to the following hypothesis: changing the sizes of (unused!) environment variables can change the alignment of variables on the stack and thus the performance of the program under test due to different behavior of hardware buffers such as caches or TLBs. (This is the source of the hypothetical example in the first paragraph, which I made up. On the machine where I am typing this, my user name appears in 12 of the environment variables that are set by default. All other things being equal, another user with a user name of a different length will have an environment size that differs by a multiple of 12 bytes.)" "So does this hypothesis hold? Yes. Using a simple computational kernel the authors observe that changing the size of the environment can often cause a slowdown of 33% and, in one particular case, by 300%. On larger benchmarks the effects are less pronounced but still present. Using the C programs from the standard SPEC CPU2006 benchmark suite, the effects of -O2 and -O3 optimizations were compared across a wide range of environment sizes. For several of the programs a wide range of variations was observed, and the results often included both positive and negative observations. The effects were not correlated with the environment size. All this means that for some benchmarks, a compiler engineer might by accident test a purported optimization in a lucky environment and observe a 10% speedup, while users of the same optimization in an unlucky environment may have a 10% slowdown on the same workload." I write this out of curiosity, as well as concern, over how this may affect GHC.

Hi, Am Samstag, den 05.05.2018, 12:33 -0400 schrieb Daniel Cartwright:
I write this out of curiosity, as well as concern, over how this may affect GHC.
our performance measurements are pretty non-scientific. For many decades, developers just ran our benchmark suite (nofib) before and after their change, hopefully on a cleanly built working copy, and pasted the most interesting numbers in the commit logs. Maybe some went for coffee to have an otherwise relatively quiet machine (or have some remote setup), maybe not. In the end, the run-time performance numbers are often ignored and we we focus on comparing the effects of *dynamic heap allocations*, which are much more stable across different environments, and which we believe are a good proxy for actual performance, at least for the kind of high-level optimizations that we work on in the core-to-core pipeline. But this assumption is folklore, and not scientifically investigated. Since two years or so we started collecting performance numbers for every commit to the GHC repository, and I wrote a tool to print comparisons: https://perf.haskell.org/ghc/ This runs on a dedicated physical machine, and still the run-time numbers were varying too widely and gave us many false warnings (and probably reported many false improvements which we of course were happy to believe). I have since switched to measuring only dynamic instruction counts with valgrind. This means that we cannot detect improvement or regressions due to certain low-level stuff, but we gain the ability to reliably measure *something* that we expect to change when we improve (or accidentally worsen) the high-level transformations. I wish there were a better way of getting a reliable, stable number that reflects the actual performance. Cheers, Joachim -- Joachim Breitner mail@joachim-breitner.de http://www.joachim-breitner.de/

Joachim Breitner schrieb:
This runs on a dedicated physical machine, and still the run-time numbers were varying too widely and gave us many false warnings (and probably reported many false improvements which we of course were happy to believe). I have since switched to measuring only dynamic instruction counts with valgrind. This means that we cannot detect improvement or regressions due to certain low-level stuff, but we gain the ability to reliably measure *something* that we expect to change when we improve (or accidentally worsen) the high-level transformations. While this matches my experience with the default settings, I had good results by tuning the number of measurements nofib does. With a high number of NoFibRuns (30+) , disabling frequency scaling, stopping background tasks and walking away from the computer till it was done I got noise down to differences of about +/-0.2% for subsequent runs.
This doesn't eliminate alignment bias and the like but at least it gives fairly reproducible results. Sven Panne schrieb:
4% is far from being "big", look e.g. at https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues where changing just the alignment of the code lead to a 10% difference. :-/ The code itself or its layout wasn't changed at all. The "Producing Wrong Data Without Doing Anything Obviously Wrong!" paper gives more funny examples.
I'm not saying that code layout has no impact, quite the opposite. The main point is: Do we really have a benchmarking machinery in place which can tell you if you've improved the real run time or made it worse? I doubt that, at least at the scale of a few percent. To reach just that simple yes/no conclusion, you would need quite a heavy machinery involving randomized linking order, varying environments (in the sense of "number and contents of environment variables"), various CPU models etc. If you do not do that, modern HW will leave you with a lot of "WTF?!" moments and wrong conclusions. You raise good points. While the example in the blog seems a bit constructed with the whole loop fitting in a cache line the principle is a real concern though. I've hit alignment issues and WTF moments plenty of times in the past when looking at micro benchmarks.
However on the scale of nofib so far I haven't really seen this happen. It's good to be aware of the chance for a whole suite to give wrong results though. I wonder if this effect is limited by GHC's tendency to use 8 byte alignment for all code (at least with tables next to code)? If we only consider 16byte (DSB Buffer) and 32 Byte (Cache Lines) relevant this reduces the possibilities by a lot after all. In the particular example I've hit however it's pretty obvious that alignment is not the issue. (And I still verified that). In the end how big the impact of a better layout would be in general is hard to quantify. Hence the question if anyone has pointers to good literature which looks into this. Cheers Andreas

Hi, Am Sonntag, den 06.05.2018, 16:41 +0200 schrieb Andreas Klebinger:
With a high number of NoFibRuns (30+) , disabling frequency scaling, stopping background tasks and walking away from the computer till it was done I got noise down to differences of about +/-0.2% for subsequent runs.
This doesn't eliminate alignment bias and the like but at least it gives fairly reproducible results.
That’s true, but it leaves alignment bias. This bit my in my work on Call Arity, as I write in my thesis: Initially, I attempted to use the actual run time measurements, but it turned out to be a mostly pointless endeavour. For example the knights benchmark would become 9% slower when enabling Call Arity (i.e. when comparing (A) to (B)), a completely unexpected result, given that the changes to the GHC Core code were reasonable. Further investigation using performance data obtained from the CPU indicated that with the changed code, the CPU’s instruction decoder was idling for more cycles, hinting at cache effects and/or bad program layout. Indeed: When I compiled the code with the compiler flag -g, which includes debugging information in the resulting binary, but should otherwise not affect the relative performance characteristics much, the unexpected difference vanished. I conclude that non-local changes to the Haskell or Core code will change the layout of the generated program code in unpredictable ways and render such run time measurements mostly meaningless. This conclusion has been drawn before [MDHS09], and recently, tools to mitigate this effect, e.g. by randomising the code layout [CB13], were created. Unfortunately, these currently target specific C compilers, so I could not use them here. In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instructions performed. This way, the variability in execution time due to code layout does not affect the results. To obtain the instruction counts I employ valgrind [NS07], which runs the benchmarks on a virtual CPU and thus produces more reliable and reproducible measurements. Unpleasant experience. Cheers, Joachim -- Joachim Breitner mail@joachim-breitner.de http://www.joachim-breitner.de/

2018-05-06 16:41 GMT+02:00 Andreas Klebinger
[...] If we only consider 16byte (DSB Buffer) and 32 Byte (Cache Lines) relevant this reduces the possibilities by a lot after all. [...]
Nitpick: Cache lines on basically all Intel/AMD processors contain 64 bytes, see e.g. http://www.agner.org/optimize/microarchitecture.pdf
participants (4)
-
Andreas Klebinger
-
Daniel Cartwright
-
Joachim Breitner
-
Sven Panne