
#15333: Weird cachegrind results for binary-trees -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Research Component: NoFib benchmark | needed suite | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by sgraf: Old description:
I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings.
Check out the most recent version of `nofib` and run the following script:
{{{#!sh #! /usr/bin/env zsh sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt1 sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt2 valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace }}}
This will compile two versions of `binary-trees`, one with an extra `trace "" $` call before the only call to the `bit` function. One would expect the version with the `trace` call (`bt1`) to allocate more than the version without (`bt2`). Indeed, the output of `+RTS -s` suggests that:
{{{ $ ./bt1 12 +RTS -s ... 43,107,560 bytes allocated in the heap ... $ ./bt2 12 +RTS -s ... 43,116,888 bytes allocated in the heap ... }}}
That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO).
Compare that to the output of the above cachegrind calls:
{{{ $ valgrind --tool=cachegrind ./bt1 12 > /dev/null ... I refs: 118,697,583 ... D refs: 43,475,212 ... $ valgrind --tool=cachegrind ./bt2 12 > /dev/null ... I refs: 116,340,710 ... D refs: 42,523,369 ... }}}
It's the other way round here! How's that possible?
Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op `trace _ a = a; {-# NOINLINE trace #-}`, btw.
I checked this on GHC 8.2.2 and a semi-recent HEAD commit (bb539cfe335eeec7989332b859b1f3162c5e105a).
New description: I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings. Check out the most recent version of `nofib` and run the following script: {{{#!sh #! /bin/sh sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt1 sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt2 valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace }}} This will compile two versions of `binary-trees`, one with an extra `trace "" $` call before the only call to the `bit` function. One would expect the version with the `trace` call (`bt1`) to allocate more than the version without (`bt2`). Indeed, the output of `+RTS -s` suggests that: {{{ $ ./bt1 12 +RTS -s ... 43,107,560 bytes allocated in the heap ... $ ./bt2 12 +RTS -s ... 43,116,888 bytes allocated in the heap ... }}} That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO). Compare that to the output of the above cachegrind calls: {{{ $ valgrind --tool=cachegrind ./bt1 12 > /dev/null ... I refs: 118,697,583 ... D refs: 43,475,212 ... $ valgrind --tool=cachegrind ./bt2 12 > /dev/null ... I refs: 116,340,710 ... D refs: 42,523,369 ... }}} It's the other way round here! How's that possible? Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op `trace _ a = a; {-# NOINLINE trace #-}`, btw. I checked this on GHC 8.2.2 and a semi-recent HEAD commit (bb539cfe335eeec7989332b859b1f3162c5e105a). -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler