Re: [GHC] #5793: make nofib awesome

4 Dec 2018

      #5793: make nofib awesome
-------------------------------------+-------------------------------------
        Reporter:  dterei            |                Owner:  michalt
            Type:  task              |               Status:  new
        Priority:  normal            |            Milestone:  ⊥
       Component:  NoFib benchmark   |              Version:
  suite                              |
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:  9571              |             Blocking:
 Related Tickets:  #15357 #15333     |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by sgraf):

 Replying to [comment:16 simonmar]:
...
The rationale for the naming is described in Will's paper
 [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.4124].  The
 distinction between `spectral` and `imaginary` is mainly size: the
 `imaginary` programs are microbenchmarks, whereas `spectral` programs are
 algorithmic kernels - larger than microbenchmarks but not real programs.
 I think it's a useful distinction.
Perhaps we also want another distinction - programs that run for long
 enough that we can obtain reliable timing results.  However that's an
 orthogonal dimension, and we can't sensibly use both dimensions for
 organising the directory structure.  So I suggest we just have a flag per
 benchmark that is true if it runs for long enough, and have a simple way
 to run just those benchmarks.
I think flagging offending benchmarks is the last resort to get reliable
 measurements, but sadly I don't see any way around that for ''some''
 benchmarks.

 Let me elaborate. I experienced fluctuations (random improvements and
 regressions) for benchmarks with either of the following traits:

 1. Runtime too short, so that we mostly measure system overhead (partly
 addressed by Phab:D4989, in response to #15357)
 2. Programs like `paraffins`, `binary-trees`, etc., which consist of a
 relatively long and allocation intensive build-up phase, followed by a
 fold that immediately collapses the maximum residency. These kind of
 benchmarks are very sensitive to changes in allocations. Also see #15333.
 3. Microbenchmarks like the `real/eff` suite, which optimise to a very
 small counting loop. Even when they run long enough for runtime
 measurements to be reproducible (e.g. `VSM` runs for about 670ms-700ms on
 my machine), a mostly irrelevant change in `base` leads to wibbles.

 = System overhead =

 Easy to fix and Phab:D4989 did most (if not all?) of the work.

 = GC =

 The GC parameterisation is much harder to remove from the equation.
 There's an example in #15333 and
 https://ghc.haskell.org/trac/ghc/ticket/9476#comment:55 ff. The idea is
 the following:

 Varying allocations leads to different points in time at which nursery and
 gen 2 heap run out of space, in turn triggering GC at different points in
 time.

 I.e. imagine that because of an overall reduction in allocations, the 5th
 major GC run happens ''immediately before'' the entire structure is
 collapsed, when without the reduction the 5th run would happen some time
 earlier (left baseline, right less total allocations):

 [[Image(https://i.imgur.com/C5224Ua.jpg, 300px)]]
 [[Image(https://i.imgur.com/C05SC5K.jpg, 300px)]]

 The residency at the 5th GC run would approach the maximum residency (not
 the number sampled by GC, but the one depending on program semantics),
 which entails a maximally costly GC run, for an impact of as much as 10%
 on total runtime. This is something that could be tuned by profile-guided
 optimisation, but otherwise there is no way to predict when a program will
 hit its maximum residency.

 The above profile is from `paraffins`, which has the typical structure of
 an allocating Haskell benchmark: Build up and then consume a large data
 structure. Although total allocations can be measured realiably on them,
 these benchmarks aren't particularly indicative of runtime performance and
 contribute to nofib's noise.

 I can think of ways to fix the situation without flagging them away, but
 not without leaving the respective benchmarks untouched:

 1. Modify every benchmark in a way that at least one collection will
 happen at the point where we think maximum residency is hit, through
 `performMajorGC` or some other kind of allocation in a loop. Yuck.
 2. Iterate every benchmark $n times from within the same Haskell process.
 In particular, this won't reset the GC between runs and will lead to
 better distribution of heap samples without actually changing the
 characteristics of the benchmark. We could shrink input problems to get $n
 sufficiently high, while still running in acceptable time (e.g. 0.5-5s).

 TODO Identify them

 = Microbenchmarks =

 I blame non-determinism in code layout or some other caching effect until
 I have a better hypothesis. These are the candidats we want to flag away
 for runtime measurements, especially `real/eff`, or make them actually do
 something meaningful.

 TODO: Identify them

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/5793#comment:38
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler