Re: [GHC] #15999: Stabilise nofib runtime measurements

13 Dec 2018

      #15999: Stabilise nofib runtime measurements
-------------------------------------+-------------------------------------
        Reporter:  sgraf             |                Owner:  (none)
            Type:  task              |               Status:  new
        Priority:  normal            |            Milestone:  ⊥
       Component:  NoFib benchmark   |              Version:  8.6.2
  suite                              |
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #5793 #9476       |  Differential Rev(s):  Phab:D5438
  #15333 #15357                      |
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Changes (by osa1):

 * cc: osa1 (added)
 * differential:   => Phab:D5438

Comment:

 Thanks for doing this!

 I think running the benchmarks multiple times is a good idea. That's what
 `criterion` does, and it provides quite reliable results, even for very
 fast
 programs.

 That said, looking at the patch and your `paraffins` example, I have some
 questions:

 - I wonder if it'd be better to run the process multiple times, instead of
   running the `main` function multiple times in the program. Why? That way
 we
   know GHC won't fuse or somehow optimize the `replicateM_ 100` call in
 the
   program, and we properly reset all the resources/global state (both the
   program's and the runtime system's, e.g. weaks pointers, threads, stable
   names). It just seems more reliable.
    - Of course this would make the analysis harder as each run will print
 GC
      stats which we need to parse and somehow combine ...
    - That said, I wonder if GC numbers are important for the purposes of
 nofib.
      In nofib we care about allocations and runtimes, as long as these
 numbers
      are stable it should be fine. So perhaps it's not too hard to repeat
 the
      process run instead of `main` function.

 - You say "GC wibbles", but I'm not sure if these are actually GC wibbles.
 I
   just checked paraffins: it doesn't do any IO (other than printing the
   results), and it's not even threaded (does not use threaded runtime,
 does not
   do `forkIO`). So I think it should be quite deterministic, and I think
 any
   wibbles are due to OS side of things. In other words, if we could have
 an OS
   that only runs `paraffins` and nothing else I think the results would be
 quite
   deterministic.

   Of course this doesn't change the fact that we're getting non-
 deterministic
   results and we should do something about it, I'm just trying to
 understand the
   root cause here.

 On my first point: if a solution for benchmarking "processes" (instead of
 "functions") using criterion-style iteration (by which I mean "provides
 stable
 results") I think it may worth trying. Few years back we used `hsbencher`
 for
 this purpose at IU, but IIRC it's a bit too heavy (lots of dependencies),
 and it
 seems unmaintained now. I vaguely recall another program for this purpose
 but I
 can't remember the name...

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:2
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler