Re: [GHC] #15333: Weird cachegrind results for binary-trees

2 Jul 2018

      #15333: Weird cachegrind results for binary-trees
-------------------------------------+-------------------------------------
        Reporter:  sgraf             |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  Research
       Component:  NoFib benchmark   |  needed
  suite                              |              Version:  8.5
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by sgraf:

Old description:
...
I'm currently investigating an alleged regression in my branch of the
late lambda lift and hit a confusing data point. Note that I'm very much
relying on cachegrinds counted instructions/memory accesses for my
findings.
Check out the most recent version of `nofib` and run the following
script:
{{{#!sh
#! /usr/bin/env zsh
sed -i 's/import Debug.Trace//g' Main.hs # Make the following line
idempotent
echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp
Main.hs # add the import for trace
sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in
the call to `bit`
ghc -O2 -XBangPatterns Main.hs -o bt1
sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call
to `bit`
ghc -O2 -XBangPatterns Main.hs -o bt2
valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace
valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace
}}}
This will compile two versions of `binary-trees`, one with an extra
`trace "" $` call before the only call to the `bit` function. One would
expect the version with the `trace` call (`bt1`) to allocate more than
the version without (`bt2`). Indeed, the output of `+RTS -s` suggests
that:
{{{
$ ./bt1 12 +RTS -s
...
43,107,560 bytes allocated in the heap
...
$ ./bt2 12 +RTS -s
...
43,116,888 bytes allocated in the heap
...
}}}
That's fine. A few benchmark runs by hand also suggested the tracing
version is a little slower (probably due to IO).
Compare that to the output of the above cachegrind calls:
{{{
$ valgrind --tool=cachegrind ./bt1 12 > /dev/null
...
I   refs:      118,697,583
...
D   refs:       43,475,212
...
$ valgrind --tool=cachegrind ./bt2 12 > /dev/null
...
I   refs:      116,340,710
...
D   refs:       42,523,369
...
}}}
It's the other way round here! How's that possible?
Even if this isn't strictly a bug in GHC or NoFib, it's relevant
nonetheless, as our benchmark infrastructure currently relies on
instruction counts. I couldn't reproduce this by writing my own no-op
`trace _ a = a; {-# NOINLINE trace #-}`, btw.
I checked this on GHC 8.2.2 and a semi-recent HEAD commit
(bb539cfe335eeec7989332b859b1f3162c5e105a).
New description:

 I'm currently investigating an alleged regression in my branch of the late
 lambda lift and hit a confusing data point. Note that I'm very much
 relying on cachegrinds counted instructions/memory accesses for my
 findings.

 Check out the most recent version of `nofib` and run the following script:

 {{{#!sh
 #! /bin/sh
 sed -i 's/import Debug.Trace//g' Main.hs # Make the following line
 idempotent
 echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp
 Main.hs # add the import for trace
 sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in
 the call to `bit`
 ghc -O2 -XBangPatterns Main.hs -o bt1
 sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call
 to `bit`
 ghc -O2 -XBangPatterns Main.hs -o bt2
 valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace
 valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace
 }}}

 This will compile two versions of `binary-trees`, one with an extra `trace
 "" $` call before the only call to the `bit` function. One would expect
 the version with the `trace` call (`bt1`) to allocate more than the
 version without (`bt2`). Indeed, the output of `+RTS -s` suggests that:

 {{{
 $ ./bt1 12 +RTS -s
 ...
 43,107,560 bytes allocated in the heap
 ...
 $ ./bt2 12 +RTS -s
 ...
 43,116,888 bytes allocated in the heap
 ...
 }}}

 That's fine. A few benchmark runs by hand also suggested the tracing
 version is a little slower (probably due to IO).

 Compare that to the output of the above cachegrind calls:

 {{{
 $ valgrind --tool=cachegrind ./bt1 12 > /dev/null
 ...
 I   refs:      118,697,583
 ...
 D   refs:       43,475,212
 ...
 $ valgrind --tool=cachegrind ./bt2 12 > /dev/null
 ...
 I   refs:      116,340,710
 ...
 D   refs:       42,523,369
 ...
 }}}

 It's the other way round here! How's that possible?

 Even if this isn't strictly a bug in GHC or NoFib, it's relevant
 nonetheless, as our benchmark infrastructure currently relies on
 instruction counts. I couldn't reproduce this by writing my own no-op
 `trace _ a = a; {-# NOINLINE trace #-}`, btw.

 I checked this on GHC 8.2.2 and a semi-recent HEAD commit
 (bb539cfe335eeec7989332b859b1f3162c5e105a).

--

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333#comment:1
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler