
- I wonder if it'd be better to run the process multiple times, instead of running the `main` function multiple times in the program. Why? That way we know GHC won't fuse or somehow optimize the `replicateM_ 100` call in
program, and we properly reset all the resources/global state (both
#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438 #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Replying to [comment:2 osa1]: the the
program's and the runtime system's, e.g. weaks pointers, threads, stable names). It just seems more reliable.
The whole point of this patch is that we iterate ''within'' the process, so that GC paramerisation doesn't affect performance (counted instructions, even) on the same program in 'unforeseen' ways, like the discontinuous, non-monotonic paraffins example. There we would expect that increasing nursery always leads to higher productivity, because the GC has to run less often. That clearly isn't the case, due to effects outlined in https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38. My hypothesis here is that fixing the productivity curve above has the following effect on benchmarking a patch that changes allocations: Basically, we fix the GC paramerisation and vary allocations instead of fixing allocations and varying GC parameters. For a fixed GC parameterisation (which is always the case when running NoFib), we would expect an optimisation that reduces allocations by 10% to lead to less GC time, thus higher productivity. Yet, in https://ghc.haskell.org/trac/ghc/ticket/9476#comment:55, there is a regression in total allocations by 18.6%, while counted instructions improve by 11.7% (similar results when measuring actual runtime). As the thread reveals, I had to do quite some digging to find out why (https://ghc.haskell.org/trac/ghc/ticket/9476#comment:71): The program produces more garbage, leading to more favorable points at which GC is done. We don't want that to dilute our comparison! This is also the reason why iterating the same program by restarting the whole process is impossible: GC paramerisation is deterministic (rightly so, IMO) and restarting the process will only measure the same dilusion over and over. On the other hand, iterating the logic $n times from within the program leads to different GC states at which the actual benchmark logic starts, thus leading to a more uniform distribution at the points in the program when GC happens. For every benchmark in the patch, I made sure that `replicateM_ 100` is actually a sensible thing to do. Compare that to the current version of [https://github.com/ghc/nofib/blob/f87d446b4e361cc82f219cf78917db9681af69b3/s... awards] where that is not the case: GHC will float out the actual computation and all that is measured is `IO`. In paraffins I used `relicateM_ 100` (or `forM_ [1..100] $ const`, rather, TODO), because the inner action has a call to `getArgs`, the result of which is used to build up the input data. GHC can't know that the result of `getArgs` doesn't change, so it can't memoize the benchmark result and this measures what we want. In other cases I actually had to use `forM_` and make use of the counter somewhere in generating the input data.
- You say "GC wibbles", but I'm not sure if these are actually GC wibbles. I just checked paraffins: it doesn't do any IO (other than printing the results), and it's not even threaded (does not use threaded runtime, does not do `forkIO`). So I think it should be quite deterministic, and I think any wibbles are due to OS side of things. In other words, if we could have an OS that only runs `paraffins` and nothing else I think the results would be quite deterministic.
Of course this doesn't change the fact that we're getting non- deterministic results and we should do something about it, I'm just trying to understand the root cause here.
The numbers are deterministic, but they are off in the sense above. By GC wibble, I mean that varying tiny parameters of the program or the GC has huge, non-monotonic, 'discontinuous' impact on the perceived performance, which makes the benchmark suite unsuited to evaluate any optimisation that changes allocations. Replying to [comment:3 osa1]:
I'm also wondering if having a fixed iteration number is a good idea. What if, in 5 years, 100 iterations for paraffins is not enough to get reliable numbers? I think `criterion` also has a solution for this (it does different number of repetitions depending on results).
The point of iterating 100 times is not to adjust runtime so that we measure something other than System overhead (type 1 in https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38). It's solely to get smoother results in the above sense. There's still the regular fast/norm/slow mechanism which are there to tune for specific runtimes, which are quite likely to vary in the future. Of course, we could just as well increment 100 to 243 for even smoother results instead of modifying the actual input problem. But having it fixed means that the curve for fast is as smooth as slow, which is a good thing, I think. It means we can measure counted instructions on fast without having to worry too much about said dilusions. Replying to [comment:4 osa1]:
Here's a library that can be useful for this purpose: http://hackage.haskell.org/package/bench
Ah, yes. That could indeed be worthwhile as a replacement for the current runner, but only if it would support measuring more than just time. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler