GHC code generation micro-optimisation patch

Hi, during the past semester I followed a seminar on the "Efficient implementation of functional languages" by Jeroen Fokker at the University Utrecht. During that course we worked on a feedback directed GHC optimisation, but that got me interested in another possible GHC backend micro-optimisation: The short story is this: An 8 line patch to GHC, tested with ghc 6.8.2 on nofib, ignoring all results with a < 0.5s runtime, yields an average runtime and compile time improvement of about 0.6%. The worst nofib slowdown is 5%, and the best speedup 8% Whether this is acceptable/enough for inclusion, is of course not up to me. The long story is that in compiler/codeGen/CgUtils.hs:mk_switch, an extra case is added to detect and treat specially the case analysis of 2 constructor datatypes.[1] Previously, case analysis of 2 constructor data types looked as follows in C--: _tmp = R1 & 3; if (_tmp >= 2) goto snd_con_lbl; With this patch, it generates the following: if (R1 & 2 != 0) goto snd_con_lbl; At least on x86 machines, the resulting assembly is more interesting. The original: movl %esi,%eax andl $3,%eax cmpl $2,%eax jae .snd_con_lbl After the patch: testl $2,%esi jne .snd_con_lbl The following table contains a summary of the nofib results: Machine Athlon32* Athlon64* Pentium4 Pentium4-dual PentiumD --------------------------------------------------------------------------------------- Vendor AMD AMD Intel Intel Intel Word size 32 64 32** 32** 32** OS Linux Linux Linux Linux Linux # CPU's 1 2 1 2 2 Mhz 950 2600 2800 2800 3000 L1 code/uops (Kb) 64 64 12 12 12 L1 data (Kb) 64 64 8 8 16 L2 (Kb) 256 1024 512 512 2048 -1 s.d. runtime (%) -1.4 -1.2 -2.4 -2.2 -5.1 +1 s.d. runtime (%) +1.1 +0.4 +1.5 +2.1 +1.3 Avg runtime (%) -0.2 -0.4 -0.4 -0.1 -1.9 Worst runtime (%) +2.5 +1.0 +3.1 +2.5 +5.0 Worst program cacheprof power simple wheel-sieve-1 integrate Best runtime (%) -5.1 -2.9 -5.0 -5.5 -8.1 Best program parstof hidden para life cryptarithm1 -1 s.d. mutator (%) -1.5 -1.6 -3.0 -3.0 -4.8 +1 s.d. mutator (%) +1.1 +0.8 +1.5 +2.2 +1.8 Avg mutator (%) -0.2 -0.4 -0.8 -0.5 -1.6 Avg bin size (%) -0.1 -0.2 -0.1 -0.1 -0.1 Avg mod size (%) -0.3 -0.5 -0.3 -0.3 -0.3 Avg comp time (%) -0.6 -0.7 -1.1 -0.6 -0.2 --------------------------------------------------------------------------------------- Cachegrind results -1 s.d. runtime (%) -2.0 -2.4 -3.2 -3.8 -3.3 +1 s.d. runtime (%) +0.5 +0.8 +1.2 +1.7 +0.6 Avg runtime (%) -0.7 -0.8 -1.0 -1.1 -1.3 -1 s.d. instr. (%) -4.2 -4.1 -4.3 -4.4 -4.4 +1 s.d. instr. (%) -0.6 -0.5 -0.6 -0.7 -0.7 Avg instr (%) -2.4 -2.3 -2.5 -2.5 -2.5 -1 s.d. cache misses (%) -0.6 -1.8 -0.4 -2.0 -2.8 +1 s.d. cache misses (%) +0.7 +2.2*** +0.7 +0.9 +1.1 Avg cache misses (%) +0.0 +0.2 +0.1 -0.6 -0.9 Avg comp time (%) -0.6 +0.3 -0.7 -0.4 -0.1 * On all Athlon results, but not on those of the Pentiums, nofib-analyse gave lots of "spurious result" warnings. On the Athlon32, nofib even failed with "output does not match" errors, which I could not verify when running diff on the output manually. I "fixed" the first problem by ignoring it and the second one by changing runstdtest to pretend the actual output always matches the expected output. ** CPU may actually be 64bits, but all software (including kernel) is 32bits *** cacheprof here has a whopping 21.0% more cache misses. The second biggest increase is 1.0% The attached patch is against 6.8.2, but currently applies without problem against the HEAD. The full nofib results can be found at http://www.students.cs.uu.nl/~rturk/fast2case-nofib-results/ The normal runs were done with NoFibRuns = 5, and for the cachegrind runs it was set to 2. On most machines I ran nofib multiple times and, both for normal and for "fast2case", took the fastest run. Thanks to Jeroen Fokker for giving the seminar that made me think of this in the first place, to Mark Stobbe and Eelco Lempsink for being great guys to work with during the seminar, and to Alexey Rodriguez for helping me keep my sanity in the midst of weird cache- and pipeline effects. And finally to the University Utrecht and Chris Regenboog for enabling me to run nofib on more computers than only my own poor old 950Mhz machine. Groeten, Remi [1] Technically, it will work for 4 constructor datatypes on 64 bit architectures too, but who cares.

Remi Turk wrote:
during the past semester I followed a seminar on the "Efficient implementation of functional languages" by Jeroen Fokker at the University Utrecht. During that course we worked on a feedback directed GHC optimisation, but that got me interested in another possible GHC backend micro-optimisation:
The short story is this:
An 8 line patch to GHC, tested with ghc 6.8.2 on nofib, ignoring all results with a < 0.5s runtime, yields an average runtime and compile time improvement of about 0.6%. The worst nofib slowdown is 5%, and the best speedup 8% Whether this is acceptable/enough for inclusion, is of course not up to me.
Hi Remi - thanks very much for the patch. It certainly looks worthwhile. I have some pending changes myself to mk_switch - it turns out that we're doing quite a bad job of compiling 3-way comparisons too (look at the code for fib sometime), so I'll try to incorporate your change into my refactorings. Cheers, Simon
participants (2)
-
Remi Turk
-
Simon Marlow