[GHC] #15333: Weird cachegrind results for binary-trees

#15333: Weird cachegrind results for binary-trees -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Research needed Component: NoFib | Version: 8.5 benchmark suite | Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: None/Unknown Unknown/Multiple | Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings. Check out the most recent version of `nofib` and run the following script: {{{#!sh #! /usr/bin/env zsh sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt1 sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt2 valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace }}} This will compile two versions of `binary-trees`, one with an extra `trace "" $` call before the only call to the `bit` function. One would expect the version with the `trace` call (`bt1`) to allocate more than the version without (`bt2`). Indeed, the output of `+RTS -s` suggests that: {{{ $ ./bt1 12 +RTS -s ... 43,107,560 bytes allocated in the heap ... $ ./bt2 12 +RTS -s ... 43,116,888 bytes allocated in the heap ... }}} That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO). Compare that to the output of the above cachegrind calls: {{{ $ valgrind --tool=cachegrind ./bt1 12 > /dev/null ... I refs: 118,697,583 ... D refs: 43,475,212 ... $ valgrind --tool=cachegrind ./bt2 12 > /dev/null ... I refs: 116,340,710 ... D refs: 42,523,369 ... }}} It's the other way round here! How's that possible? Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op `trace _ a = a; {-# NOINLINE trace #-}`, btw. I checked this on GHC 8.2.2 and a semi-recent HEAD commit (bb539cfe335eeec7989332b859b1f3162c5e105a). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15333: Weird cachegrind results for binary-trees -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Research Component: NoFib benchmark | needed suite | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by sgraf: Old description:
I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings.
Check out the most recent version of `nofib` and run the following script:
{{{#!sh #! /usr/bin/env zsh sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt1 sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt2 valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace }}}
This will compile two versions of `binary-trees`, one with an extra `trace "" $` call before the only call to the `bit` function. One would expect the version with the `trace` call (`bt1`) to allocate more than the version without (`bt2`). Indeed, the output of `+RTS -s` suggests that:
{{{ $ ./bt1 12 +RTS -s ... 43,107,560 bytes allocated in the heap ... $ ./bt2 12 +RTS -s ... 43,116,888 bytes allocated in the heap ... }}}
That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO).
Compare that to the output of the above cachegrind calls:
{{{ $ valgrind --tool=cachegrind ./bt1 12 > /dev/null ... I refs: 118,697,583 ... D refs: 43,475,212 ... $ valgrind --tool=cachegrind ./bt2 12 > /dev/null ... I refs: 116,340,710 ... D refs: 42,523,369 ... }}}
It's the other way round here! How's that possible?
Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op `trace _ a = a; {-# NOINLINE trace #-}`, btw.
I checked this on GHC 8.2.2 and a semi-recent HEAD commit (bb539cfe335eeec7989332b859b1f3162c5e105a).
New description: I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings. Check out the most recent version of `nofib` and run the following script: {{{#!sh #! /bin/sh sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt1 sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt2 valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace }}} This will compile two versions of `binary-trees`, one with an extra `trace "" $` call before the only call to the `bit` function. One would expect the version with the `trace` call (`bt1`) to allocate more than the version without (`bt2`). Indeed, the output of `+RTS -s` suggests that: {{{ $ ./bt1 12 +RTS -s ... 43,107,560 bytes allocated in the heap ... $ ./bt2 12 +RTS -s ... 43,116,888 bytes allocated in the heap ... }}} That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO). Compare that to the output of the above cachegrind calls: {{{ $ valgrind --tool=cachegrind ./bt1 12 > /dev/null ... I refs: 118,697,583 ... D refs: 43,475,212 ... $ valgrind --tool=cachegrind ./bt2 12 > /dev/null ... I refs: 116,340,710 ... D refs: 42,523,369 ... }}} It's the other way round here! How's that possible? Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op `trace _ a = a; {-# NOINLINE trace #-}`, btw. I checked this on GHC 8.2.2 and a semi-recent HEAD commit (bb539cfe335eeec7989332b859b1f3162c5e105a). -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15333: Weird cachegrind results for binary-trees -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Research Component: NoFib benchmark | needed suite | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): I'm still not really sure what's going on, but I feel like alignment instructions might skew cachegrinds instruction counts according to [http://valgrind.org/docs/manual/cg-manual.html section 5.2.10 of the valgrind manual]. I've had a few other cases in the meantime. Passing `-fllvm -optlo -Os` to use the LLVM backend optimising for code size helps. Is there some equivalent for the NCG? That would be far more reliable for the counted instruction metric that our benchmark CI relies on. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15333: Weird cachegrind results for binary-trees -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Research Component: NoFib benchmark | needed suite | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by sgraf: Old description:
I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings.
Check out the most recent version of `nofib` and run the following script:
{{{#!sh #! /bin/sh sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt1 sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt2 valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace }}}
This will compile two versions of `binary-trees`, one with an extra `trace "" $` call before the only call to the `bit` function. One would expect the version with the `trace` call (`bt1`) to allocate more than the version without (`bt2`). Indeed, the output of `+RTS -s` suggests that:
{{{ $ ./bt1 12 +RTS -s ... 43,107,560 bytes allocated in the heap ... $ ./bt2 12 +RTS -s ... 43,116,888 bytes allocated in the heap ... }}}
That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO).
Compare that to the output of the above cachegrind calls:
{{{ $ valgrind --tool=cachegrind ./bt1 12 > /dev/null ... I refs: 118,697,583 ... D refs: 43,475,212 ... $ valgrind --tool=cachegrind ./bt2 12 > /dev/null ... I refs: 116,340,710 ... D refs: 42,523,369 ... }}}
It's the other way round here! How's that possible?
Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op `trace _ a = a; {-# NOINLINE trace #-}`, btw.
I checked this on GHC 8.2.2 and a semi-recent HEAD commit (bb539cfe335eeec7989332b859b1f3162c5e105a).
New description: I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings. Check out the most recent version of `nofib` and run the following script: {{{#!sh #! /bin/sh sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace # bt1: Vanilly sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt1 # bt2: Additional trace call sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt2 valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace }}} This will compile two versions of `binary-trees`, the original, unchanged version and one with an extra `trace "" $` call before the only call to the `bit` function. One would expect the version with the `trace` call (`bt2`) to allocate more than the version without (`bt1`). Indeed, the output of `+RTS -s` suggests that: {{{ $ ./bt1 12 +RTS -s ... 43,107,560 bytes allocated in the heap ... $ ./bt2 12 +RTS -s ... 43,116,888 bytes allocated in the heap ... }}} That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO). Compare that to the output of the above cachegrind calls: {{{ $ valgrind --tool=cachegrind ./bt1 12 > /dev/null ... I refs: 118,697,583 ... D refs: 43,475,212 ... $ valgrind --tool=cachegrind ./bt2 12 > /dev/null ... I refs: 116,340,710 ... D refs: 42,523,369 ... }}} It's the other way round here! How's that possible? Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op `trace _ a = a; {-# NOINLINE trace #-}`, btw. I checked this on GHC 8.2.2, 8.4.3 and a semi-recent HEAD commit (bb539cfe335eeec7989332b859b1f3162c5e105a). -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15333: Weird cachegrind results for binary-trees -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Research Component: NoFib benchmark | needed suite | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Passing `-fllvm -optlo -Os` doesn't seem to remedy the symptoms of the original bug. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15333: Weird cachegrind results for binary-trees -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Research Component: NoFib benchmark | needed suite | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by sgraf: Old description:
I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings.
Check out the most recent version of `nofib` and run the following script:
{{{#!sh #! /bin/sh sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace
# bt1: Vanilly sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt1
# bt2: Additional trace call sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt2
valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace }}}
This will compile two versions of `binary-trees`, the original, unchanged version and one with an extra `trace "" $` call before the only call to the `bit` function. One would expect the version with the `trace` call (`bt2`) to allocate more than the version without (`bt1`). Indeed, the output of `+RTS -s` suggests that:
{{{ $ ./bt1 12 +RTS -s ... 43,107,560 bytes allocated in the heap ... $ ./bt2 12 +RTS -s ... 43,116,888 bytes allocated in the heap ... }}}
That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO).
Compare that to the output of the above cachegrind calls:
{{{ $ valgrind --tool=cachegrind ./bt1 12 > /dev/null ... I refs: 118,697,583 ... D refs: 43,475,212 ... $ valgrind --tool=cachegrind ./bt2 12 > /dev/null ... I refs: 116,340,710 ... D refs: 42,523,369 ... }}}
It's the other way round here! How's that possible?
Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op `trace _ a = a; {-# NOINLINE trace #-}`, btw.
I checked this on GHC 8.2.2, 8.4.3 and a semi-recent HEAD commit (bb539cfe335eeec7989332b859b1f3162c5e105a).
New description: I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings. Check out the most recent version of `nofib`, enter `shootout/binary- trees` and run the following script: {{{#!sh #! /bin/sh sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace # bt1: Vanilly sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt1 # bt2: Additional trace call sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit` ghc -O2 -XBangPatterns Main.hs -o bt2 valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace }}} This will compile two versions of `binary-trees`, the original, unchanged version and one with an extra `trace "" $` call before the only call to the `bit` function. One would expect the version with the `trace` call (`bt2`) to allocate more than the version without (`bt1`). Indeed, the output of `+RTS -s` suggests that: {{{ $ ./bt1 12 +RTS -s ... 43,107,560 bytes allocated in the heap ... $ ./bt2 12 +RTS -s ... 43,116,888 bytes allocated in the heap ... }}} That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO). Compare that to the output of the above cachegrind calls: {{{ $ valgrind --tool=cachegrind ./bt1 12 > /dev/null ... I refs: 118,697,583 ... D refs: 43,475,212 ... $ valgrind --tool=cachegrind ./bt2 12 > /dev/null ... I refs: 116,340,710 ... D refs: 42,523,369 ... }}} It's the other way round here! How's that possible? Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op `trace _ a = a; {-# NOINLINE trace #-}`, btw. I checked this on GHC 8.2.2, 8.4.3 and a semi-recent HEAD commit (bb539cfe335eeec7989332b859b1f3162c5e105a). -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15333: Weird cachegrind results for binary-trees -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Research Component: NoFib benchmark | needed suite | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): I think I figured this out. This is due to nursery size, specifically the points in time after which the GC kicks in. For benchmarks with a working set of varying size, GC Gen 0 collection performance heavily depends on the size of the working set within the nursery ''at the time collection is triggered''. There obviously is no way to predict this reliably. In this case, the default nursery size seemed to be particularly unfortunate, as `+RTS -S` reveals: {{{ Alloc Copied Live GC GC TOT TOT Page Flts bytes bytes bytes user elap user elap 1097344 490864 548288 0.001 0.001 0.002 0.002 0 0 (Gen: 0) 1044752 645640 706848 0.001 0.001 0.003 0.003 0 0 (Gen: 0) 1046272 592608 650032 0.001 0.001 0.004 0.004 0 0 (Gen: 1) 1079040 17808 699952 0.000 0.000 0.005 0.005 0 0 (Gen: 0) 1046272 17808 717104 0.000 0.000 0.006 0.006 0 0 (Gen: 0) 1046272 17808 734256 0.000 0.000 0.006 0.006 0 0 (Gen: 0) 1079040 17808 784176 0.000 0.000 0.007 0.007 0 0 (Gen: 0) 1046272 17808 801328 0.000 0.000 0.007 0.007 0 0 (Gen: 0) 1078160 111840 912488 0.000 0.000 0.007 0.007 0 0 (Gen: 0) 1036312 96056 911464 0.000 0.000 0.008 0.008 0 0 (Gen: 0) 1036288 6608 915560 0.000 0.000 0.008 0.008 0 0 (Gen: 0) 1036288 6608 919656 0.000 0.000 0.009 0.009 0 0 (Gen: 0) 1036288 6608 923752 0.000 0.000 0.009 0.009 0 0 (Gen: 0) 1036288 6608 927848 0.000 0.000 0.009 0.009 0 0 (Gen: 0) 1036288 6608 931944 0.000 0.000 0.010 0.010 0 0 (Gen: 0) 1036288 6608 936040 0.000 0.000 0.010 0.010 0 0 (Gen: 0) 1036288 6608 940136 0.000 0.000 0.011 0.011 0 0 (Gen: 0) 1044920 29264 966864 0.000 0.000 0.011 0.011 0 0 (Gen: 0) 1045504 25496 963536 0.000 0.000 0.011 0.011 0 0 (Gen: 0) 1045504 6160 964560 0.000 0.000 0.012 0.012 0 0 (Gen: 0) 1045504 6160 965584 0.000 0.000 0.012 0.012 0 0 (Gen: 0) 1045504 6160 966608 0.000 0.000 0.013 0.013 0 0 (Gen: 0) 1045504 6160 967632 0.000 0.000 0.013 0.013 0 0 (Gen: 0) 1045504 6160 968656 0.000 0.000 0.014 0.014 0 0 (Gen: 0) 1045504 6160 969680 0.000 0.000 0.014 0.014 0 0 (Gen: 0) 1047464 59208 1023728 0.000 0.000 0.015 0.015 0 0 (Gen: 0) 1047808 57968 1022896 0.000 0.000 0.015 0.015 0 0 (Gen: 0) 1047808 52912 1023120 0.000 0.000 0.016 0.016 0 0 (Gen: 0) 1047808 52944 1023376 0.000 0.000 0.016 0.016 0 0 (Gen: 0) 1047808 52944 1023632 0.000 0.000 0.017 0.017 0 0 (Gen: 0) 1047808 52944 1023888 0.000 0.000 0.017 0.017 0 0 (Gen: 0) 1047808 52944 1024144 0.000 0.000 0.018 0.018 0 0 (Gen: 0) 1047808 52944 1024400 0.000 0.000 0.018 0.018 0 0 (Gen: 0) 1048296 176376 1148064 0.000 0.000 0.018 0.018 0 0 (Gen: 0) 1048384 175792 1147856 0.000 0.000 0.019 0.019 0 0 (Gen: 0) 1048384 174288 1147856 0.000 0.000 0.019 0.019 0 0 (Gen: 0) 1048384 174288 1147856 0.000 0.000 0.020 0.020 0 0 (Gen: 0) 1048384 174288 1147856 0.000 0.000 0.020 0.020 0 0 (Gen: 0) 1048384 174288 1147856 0.000 0.000 0.021 0.021 0 0 (Gen: 0) 1048384 174288 1147856 0.000 0.000 0.021 0.021 0 0 (Gen: 0) 1048384 174288 1147856 0.000 0.000 0.022 0.022 0 0 (Gen: 0) 116160 3488 44576 0.000 0.000 0.022 0.022 0 0 (Gen: 1) 0 0.000 0.000 }}} This is the log for `bt1` (e.g. without the call to `trace`). Note the higher numbers in the `Copied bytes` column compared to `bt2` (with the redundant `trace` call), in particular the whole second half: {{{ Alloc Copied Live GC GC TOT TOT Page Flts bytes bytes bytes user elap user elap 1097448 488888 546312 0.001 0.001 0.002 0.002 0 0 (Gen: 0) 1044664 645424 706840 0.001 0.001 0.004 0.004 0 0 (Gen: 0) 1046272 591480 648904 0.001 0.001 0.005 0.005 0 0 (Gen: 1) 1079040 17488 698824 0.000 0.000 0.006 0.006 0 0 (Gen: 0) 1046272 17488 715976 0.000 0.000 0.007 0.007 0 0 (Gen: 0) 1046272 17488 733128 0.000 0.000 0.008 0.008 0 0 (Gen: 0) 1079040 17488 783048 0.000 0.000 0.008 0.008 0 0 (Gen: 0) 1046272 17488 800200 0.000 0.000 0.009 0.009 0 0 (Gen: 0) 1078376 108336 908176 0.000 0.000 0.010 0.010 0 0 (Gen: 0) 1036312 96600 911232 0.000 0.000 0.011 0.011 0 0 (Gen: 0) 1036288 7088 915248 0.000 0.000 0.012 0.012 0 0 (Gen: 0) 1036288 7120 919344 0.000 0.000 0.012 0.012 0 0 (Gen: 0) 1036288 7120 923440 0.000 0.000 0.013 0.013 0 0 (Gen: 0) 1036288 7120 927536 0.000 0.000 0.014 0.014 0 0 (Gen: 0) 1036288 7120 931632 0.000 0.000 0.014 0.014 0 0 (Gen: 0) 1036288 7120 935728 0.000 0.000 0.015 0.015 0 0 (Gen: 0) 1036288 7120 939824 0.000 0.000 0.015 0.015 0 0 (Gen: 0) 1045264 24608 961384 0.000 0.000 0.015 0.015 0 0 (Gen: 0) 1045504 20824 958056 0.000 0.000 0.016 0.016 0 0 (Gen: 0) 1045504 1488 959080 0.000 0.000 0.016 0.016 0 0 (Gen: 0) 1045504 1488 960104 0.000 0.000 0.016 0.016 0 0 (Gen: 0) 1045504 1488 961128 0.000 0.000 0.017 0.017 0 0 (Gen: 0) 1045504 1488 962152 0.000 0.000 0.017 0.017 0 0 (Gen: 0) 1045504 1488 963176 0.000 0.000 0.017 0.017 0 0 (Gen: 0) 1045504 1488 964200 0.000 0.000 0.018 0.018 0 0 (Gen: 0) 1047600 53192 1016904 0.000 0.000 0.018 0.018 0 0 (Gen: 0) 1047808 51952 1016072 0.000 0.000 0.019 0.019 0 0 (Gen: 0) 1047808 46896 1016296 0.000 0.000 0.019 0.019 0 0 (Gen: 0) 1047808 46928 1016552 0.000 0.000 0.019 0.019 0 0 (Gen: 0) 1047808 46928 1016808 0.000 0.000 0.020 0.020 0 0 (Gen: 0) 1047808 46928 1017064 0.000 0.000 0.020 0.020 0 0 (Gen: 0) 1047808 46928 1017320 0.000 0.000 0.020 0.020 0 0 (Gen: 0) 1047808 46928 1017576 0.000 0.000 0.021 0.021 0 0 (Gen: 0) 1048176 168760 1139640 0.000 0.000 0.021 0.021 0 0 (Gen: 0) 1048384 168176 1139432 0.000 0.000 0.022 0.022 0 0 (Gen: 0) 1048384 166672 1139432 0.000 0.000 0.022 0.022 0 0 (Gen: 0) 1048384 166672 1139432 0.000 0.000 0.023 0.023 0 0 (Gen: 0) 1048384 166672 1139432 0.000 0.000 0.023 0.023 0 0 (Gen: 0) 1048384 166672 1139432 0.000 0.000 0.023 0.023 0 0 (Gen: 0) 1048384 166672 1139432 0.000 0.000 0.024 0.024 0 0 (Gen: 0) 1048384 166672 1139432 0.000 0.000 0.024 0.024 0 0 (Gen: 0) 123776 2504 43592 0.000 0.000 0.024 0.025 0 0 (Gen: 1) 0 0.000 0.000 }}} I see no easy way around this other than 'deactivating' the GC by choosing a big enough nursery, such as `-A11M` in this case. Taking the GC out of the equation in this way may be somewhat misrepresentative, but at least we get cachegrind numbers that can be trusted. There are other cases I hit this, like `bspt`. I improved allocation a bit, triggering GC at later points when the working set was bigger. Result was an alleged regression of 6% in counted instructions, even though the relevant optimized snippet in isolation got faster. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15333: Nursery size adulterates cachegrind metrics in nofib -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Research Component: NoFib benchmark | needed suite | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15333: Nursery size adulterates cachegrind metrics in nofib -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Research Component: NoFib benchmark | needed suite | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #15999 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by sgraf): * related: => #5793 #15999 Comment: See my recent concerns about the non-monotone impact of allocations on productivity. Here's a productivity curve: [[Image(https://i.imgur.com/kiZwF0R.png)]] It's not as bad as with paraffins in #15999, but not really monotone either. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15333#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC