Cache miss performance costs for Haskell programs?

Hi, Any Haskell profiling and performance tuning blog or tutorial will advise the use of memory space and runtime profiling, using GHC tooling. Far less is said about the impact of increased cache miss rates as program size increases. The paper "Secrets of the Glasgow Haskell Compiler inliner", in the Journal of Functional Programming, July 2002, talks a lot about the benefits of inlining, i.e. it's part of GHCs simplifier as it enables many other optimisations, some that ultimately reduce program size. Not much is said about detrimental effect that bad inlining choices has to runtime. The paper says: "Bloated programs are bad (increased compilation time, lower cache hit rates)" in Section 2.2. I'd really like to see how badly Haskell runtimes are affected as cache hit rates decrease. Is anyone aware of any empirical studies, or papers, or blog posts, that show examples where: For the same Haskell program, increasing inlining causes lower cache hit rates, which slows down runtime due to costly cycles to fetch from main memory more often. Thanks, -- Rob

Hi Rob, Am Dienstag, den 30.08.2016, 16:23 +0100 schrieb Rob Stewart:
I'd really like to see how badly Haskell runtimes are affected as cache hit rates decrease. Is anyone aware of any empirical studies, or papers, or blog posts, that show examples where:
For the same Haskell program, increasing inlining causes lower cache hit rates, which slows down runtime due to costly cycles to fetch from main memory more often.
note quite inlinine, but I have had an experience like this when evaluating the benefit of the Call Arity program analysis. Here is the relevant excerpt from my thesis (§3.5.3, you can find the thesis at http://www.joachim-breitner.de/thesis/): Initially, I attempted to use the actual run time measurements, but it turned out to be a mostly pointless endeavour. For example the knights benchmark would become 9% slower when enabling Call Arity (i.e. when comparing (A) to (B)), a completely unexpected result, given that the changes to the GHC Core code were reasonable. Further investigation using performance data obtained from the CPU indicated that with the changed code, the CPU’s instruction decoder was idling for more cycles, hinting at cache effects and/or bad program layout. Indeed: When I compiled the code with the compiler flag -g, which includes debugging information in the resulting binary, but should oth- erwise not affect the relative performance characteristics much, the un- expected difference vanished. I conclude that non-local changes to the Haskell or Core code will change the layout of the generated program code in unpredictable ways and render such run time measurements mostly meaningless. This conclusion has been drawn before [MDHS09], and recently, tools to mitigate this effect, e.g. by randomising the code layout [CB13], were created. Unfortunately, these currently target specific C compilers, so I could not use them here. In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instruc- tions performed. This way, the variability in execution time due to code layout does not affect the results. To obtain the instruction counts I em- ploy valgrind [NS07], which runs the benchmarks on a virtual CPU and thus produces more reliable and reproducible measurements. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Hi Joachim, Thanks for the reply, and really interesting work!
In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instructions performed.
Are these a count of all instructions performed at runtime? Did you
isolate a count of just the memory access instructions that ended up
fetching from main memory? Also, did you measure the clock cycle
latency averages for memory access instructions for (B) (C) and (D),
to get an indication of cache misses?
I've just measured this program with perf:
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
main = print $ fib 39
I.e.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses
fib-exe
63245986
Performance counter stats for 'fib-exe':
5989.527848 task-clock (msec) # 1.268 CPUs
utilized
18,825,545,684 cycles # 3.143 GHz
42,833,528,405 instructions # 2.28 insns per
cycle
70,496,109 cache-references # 11.770 M/sec
161,186 cache-misses # 0.229 % of all
cache refs
4.725101260 seconds time elapsed
For three runs, the number of cache misses were: 161186 (4.7 seconds),
80802 (4.4 seconds), and 102681 (4.5 seconds).
I'm interested in hearing about Haskell developers who've used
Valgrind, or perf, to start caring about ensuring minimising
executable size, by injecting NOINLINE pragmas or removing INLINE
pragmas.
--
Rob
On 30 August 2016 at 16:28, Joachim Breitner
Hi Rob,
Am Dienstag, den 30.08.2016, 16:23 +0100 schrieb Rob Stewart:
I'd really like to see how badly Haskell runtimes are affected as cache hit rates decrease. Is anyone aware of any empirical studies, or papers, or blog posts, that show examples where:
For the same Haskell program, increasing inlining causes lower cache hit rates, which slows down runtime due to costly cycles to fetch from main memory more often.
note quite inlinine, but I have had an experience like this when evaluating the benefit of the Call Arity program analysis. Here is the relevant excerpt from my thesis (§3.5.3, you can find the thesis at http://www.joachim-breitner.de/thesis/):
Initially, I attempted to use the actual run time measurements, but it turned out to be a mostly pointless endeavour. For example the knights benchmark would become 9% slower when enabling Call Arity (i.e. when comparing (A) to (B)), a completely unexpected result, given that the changes to the GHC Core code were reasonable. Further investigation using performance data obtained from the CPU indicated that with the changed code, the CPU’s instruction decoder was idling for more cycles, hinting at cache effects and/or bad program layout.
Indeed: When I compiled the code with the compiler flag -g, which includes debugging information in the resulting binary, but should oth- erwise not affect the relative performance characteristics much, the un- expected difference vanished. I conclude that non-local changes to the Haskell or Core code will change the layout of the generated program code in unpredictable ways and render such run time measurements mostly meaningless.
This conclusion has been drawn before [MDHS09], and recently, tools to mitigate this effect, e.g. by randomising the code layout [CB13], were created. Unfortunately, these currently target specific C compilers, so I could not use them here.
In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instruc- tions performed. This way, the variability in execution time due to code layout does not affect the results. To obtain the instruction counts I em- ploy valgrind [NS07], which runs the benchmarks on a virtual CPU and thus produces more reliable and reproducible measurements.
Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Keep in mind that OS interrupts process about 1000 times per second and completely invalidates cache that many times. Greets. On 08/30/2016 05:57 PM, Rob Stewart wrote:
Hi Joachim,
Thanks for the reply, and really interesting work!
In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instructions performed. Are these a count of all instructions performed at runtime? Did you isolate a count of just the memory access instructions that ended up fetching from main memory? Also, did you measure the clock cycle latency averages for memory access instructions for (B) (C) and (D), to get an indication of cache misses?
I've just measured this program with perf:
fib 0 = 0 fib 1 = 1 fib n = fib (n-1) + fib (n-2)
main = print $ fib 39
I.e.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses fib-exe 63245986
Performance counter stats for 'fib-exe':
5989.527848 task-clock (msec) # 1.268 CPUs utilized 18,825,545,684 cycles # 3.143 GHz 42,833,528,405 instructions # 2.28 insns per cycle 70,496,109 cache-references # 11.770 M/sec 161,186 cache-misses # 0.229 % of all cache refs
4.725101260 seconds time elapsed
For three runs, the number of cache misses were: 161186 (4.7 seconds), 80802 (4.4 seconds), and 102681 (4.5 seconds).
I'm interested in hearing about Haskell developers who've used Valgrind, or perf, to start caring about ensuring minimising executable size, by injecting NOINLINE pragmas or removing INLINE pragmas.
-- Rob
On 30 August 2016 at 16:28, Joachim Breitner
wrote: Hi Rob,
Am Dienstag, den 30.08.2016, 16:23 +0100 schrieb Rob Stewart:
I'd really like to see how badly Haskell runtimes are affected as cache hit rates decrease. Is anyone aware of any empirical studies, or papers, or blog posts, that show examples where:
For the same Haskell program, increasing inlining causes lower cache hit rates, which slows down runtime due to costly cycles to fetch from main memory more often. note quite inlinine, but I have had an experience like this when evaluating the benefit of the Call Arity program analysis. Here is the relevant excerpt from my thesis (§3.5.3, you can find the thesis at http://www.joachim-breitner.de/thesis/):
Initially, I attempted to use the actual run time measurements, but it turned out to be a mostly pointless endeavour. For example the knights benchmark would become 9% slower when enabling Call Arity (i.e. when comparing (A) to (B)), a completely unexpected result, given that the changes to the GHC Core code were reasonable. Further investigation using performance data obtained from the CPU indicated that with the changed code, the CPU’s instruction decoder was idling for more cycles, hinting at cache effects and/or bad program layout.
Indeed: When I compiled the code with the compiler flag -g, which includes debugging information in the resulting binary, but should oth- erwise not affect the relative performance characteristics much, the un- expected difference vanished. I conclude that non-local changes to the Haskell or Core code will change the layout of the generated program code in unpredictable ways and render such run time measurements mostly meaningless.
This conclusion has been drawn before [MDHS09], and recently, tools to mitigate this effect, e.g. by randomising the code layout [CB13], were created. Unfortunately, these currently target specific C compilers, so I could not use them here.
In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instruc- tions performed. This way, the variability in execution time due to code layout does not affect the results. To obtain the instruction counts I em- ploy valgrind [NS07], which runs the benchmarks on a virtual CPU and thus produces more reliable and reproducible measurements.
Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

2016-08-30 17:57 GMT+02:00 Rob Stewart
[...] I'm interested in hearing about Haskell developers who've used Valgrind, or perf, to start caring about ensuring minimising executable size, by injecting NOINLINE pragmas or removing INLINE pragmas.
Note that the size of an executable and memory cache hit rates are only some of the many things heavily influencing performance. You can easily get a slowdown of an order of magnitude if your program e.g. interacts badly with branch prediction or if it has pseudo-dependencies between instructions due to partial register writes. One can work around these issues on a low level: * If the branch predictor has no clue, backward jumps are normally assumed to be taken (loops!) and forward jumps are assumed to be not taken. So your code generator should better layout the common case in a straight line. * Use the right dependency-breaking instructions when needed, see e.g. section 3.5.1.8 "Clearing Registers and Dependency Breaking Idioms" in http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimizat... . * If you ever wondered why e.g. the Intel processors have various seemingly identical instructions, the answer is: Some registers are internally "typed", and you pay a relatively high cost if you access the in an "untyped" manner. If LLVM is used as the backend, this should be handled automatically, at least if we give the right hints to it. In a nutshell: Size matters only sometimes. ;-) If you really care about performance in detail, you have to look at lots of perf values, e.g. pipeline stalls, branch mispredictions, etc. etc.

Hi Rob, Am Dienstag, den 30.08.2016, 16:57 +0100 schrieb Rob Stewart:
Thanks for the reply, and really interesting work!
In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instructions performed.
Are these a count of all instructions performed at runtime? Did you isolate a count of just the memory access instructions that ended up fetching from main memory? Also, did you measure the clock cycle latency averages for memory access instructions for (B) (C) and (D), to get an indication of cache misses?
I did not dig deeper. I just ran the test suite under valgrind (nofib has support for that) and took the number of instructions that came out. My working hypothesis was that the effect of high level (i.e. Core) transformation on code layout / branch prediction / pipeline stuff / cache hits and misses are too hard to predict and such measurements would not give me useful information. But you seem to be after a more optimistic and principled approach here, so I’ll be keen to hear what you find out. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

On Tue, Aug 30, 2016, at 08:53 PM, Rob Stewart wrote:
Any Haskell profiling and performance tuning blog or tutorial will advise the use of memory space and runtime profiling, using GHC tooling. Far less is said about the impact of increased cache miss rates as program size increases.
The paper "Secrets of the Glasgow Haskell Compiler inliner", in the Journal of Functional Programming, July 2002, talks a lot about the benefits of inlining, i.e. it's part of GHCs simplifier as it enables many other optimisations, some that ultimately reduce program size.
Not much is said about detrimental effect that bad inlining choices has to runtime. The paper says: "Bloated programs are bad (increased compilation time, lower cache hit rates)" in Section 2.2.
I'd really like to see how badly Haskell runtimes are affected as cache hit rates decrease. Is anyone aware of any empirical studies, or papers, or blog posts, that show examples where:
For the same Haskell program, increasing inlining causes lower cache hit rates, which slows down runtime due to costly cycles to fetch from main memory more often.
Hi Rob, Have you looked at the `perf' tool supported in recent linux kernel versions? It seem to have tools to report cache statistics. http://developers.redhat.com/blog/2014/03/10/determining-whether-an-applicat... I haven't used it on Haskell programs though.. -- Ramakrishnan

Accidentally didn't address the mailing list: Additionally, if you want to investigate things like cache misses, etc. Intel VTune Amplifier is an amazing profiling tool and there is a non-commercial open source license available for it. Cheers, Merijn
On 31 Aug 2016, at 11:52, Ramakrishnan Muthukrishnan
wrote: On Tue, Aug 30, 2016, at 08:53 PM, Rob Stewart wrote:
Any Haskell profiling and performance tuning blog or tutorial will advise the use of memory space and runtime profiling, using GHC tooling. Far less is said about the impact of increased cache miss rates as program size increases.
The paper "Secrets of the Glasgow Haskell Compiler inliner", in the Journal of Functional Programming, July 2002, talks a lot about the benefits of inlining, i.e. it's part of GHCs simplifier as it enables many other optimisations, some that ultimately reduce program size.
Not much is said about detrimental effect that bad inlining choices has to runtime. The paper says: "Bloated programs are bad (increased compilation time, lower cache hit rates)" in Section 2.2.
I'd really like to see how badly Haskell runtimes are affected as cache hit rates decrease. Is anyone aware of any empirical studies, or papers, or blog posts, that show examples where:
For the same Haskell program, increasing inlining causes lower cache hit rates, which slows down runtime due to costly cycles to fetch from main memory more often.
Hi Rob,
Have you looked at the `perf' tool supported in recent linux kernel versions? It seem to have tools to report cache statistics.
http://developers.redhat.com/blog/2014/03/10/determining-whether-an-applicat...
I haven't used it on Haskell programs though..
-- Ramakrishnan _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
participants (6)
-
Branimir Maksimovic
-
Joachim Breitner
-
Merijn Verstraaten
-
Ramakrishnan Muthukrishnan
-
Rob Stewart
-
Sven Panne