[GHC] #15999: Stabilise nofib runtime measurements

#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib | Version: 8.6.2 benchmark suite | Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: None/Unknown Unknown/Multiple | Test Case: | Blocked By: Blocking: | Related Tickets: #5793 #9476 | #15333 #15357 Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- With Phab:D4989 (cf. #15357) having hit `nofib` master, there are still many benchmarks that are unstable. I identified three causes for unstability in https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38. With system overhead mostly out of the equation, there are still two related tasks left: 1. Identify benchmarks with GC wibbles. Plan: Look at counted instructions while varying heap size with just one generation. A wibbling benchmark should have quite diverse sampled maximum residency (as opposed to a microbenchmark, which should have quite stable instruction count). Then fix these by iterating `main` 'often enough'. Maybe look at total bytes allocated for that, we want this to be monotonically declining as the initial heap size grows. 2. Now, all benchmarks should have stable instruction count. If not, maybe there's another class of benchmarks I didn't identify yet in #5793. Of these benchmarks, there are a few, like `real/eff/CS`, that still have highly unstable runtimes. Fix these 'microbenchmarks' by hiding them behind a flag. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): I figured out a more useful metric to identify GC wibbles: Watching productivity rate change while increasing the nursery. Here's the current state of affairs for `paraffins`: [[Image(https://i.imgur.com/XxdNJQ8.png)]] While there are some wibbles due to short running time, it's clear that the function is highly discontinuous. We want something more like this: [[Image(https://i.imgur.com/fA4S38S.png)]] That's a much smoother curve (which of course is subject to measuring inaccuracy). I got this just by iterating `main` 100 times. Obviously, this has implications on running time of the benchmark, so the `*MODE`s have to be adjusted accordingly. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Changes (by sgraf): * Attachment "prod.py" added. Python script for sampling and plotting productivity rates -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438 #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * cc: osa1 (added) * differential: => Phab:D5438 Comment: Thanks for doing this! I think running the benchmarks multiple times is a good idea. That's what `criterion` does, and it provides quite reliable results, even for very fast programs. That said, looking at the patch and your `paraffins` example, I have some questions: - I wonder if it'd be better to run the process multiple times, instead of running the `main` function multiple times in the program. Why? That way we know GHC won't fuse or somehow optimize the `replicateM_ 100` call in the program, and we properly reset all the resources/global state (both the program's and the runtime system's, e.g. weaks pointers, threads, stable names). It just seems more reliable. - Of course this would make the analysis harder as each run will print GC stats which we need to parse and somehow combine ... - That said, I wonder if GC numbers are important for the purposes of nofib. In nofib we care about allocations and runtimes, as long as these numbers are stable it should be fine. So perhaps it's not too hard to repeat the process run instead of `main` function. - You say "GC wibbles", but I'm not sure if these are actually GC wibbles. I just checked paraffins: it doesn't do any IO (other than printing the results), and it's not even threaded (does not use threaded runtime, does not do `forkIO`). So I think it should be quite deterministic, and I think any wibbles are due to OS side of things. In other words, if we could have an OS that only runs `paraffins` and nothing else I think the results would be quite deterministic. Of course this doesn't change the fact that we're getting non- deterministic results and we should do something about it, I'm just trying to understand the root cause here. On my first point: if a solution for benchmarking "processes" (instead of "functions") using criterion-style iteration (by which I mean "provides stable results") I think it may worth trying. Few years back we used `hsbencher` for this purpose at IU, but IIRC it's a bit too heavy (lots of dependencies), and it seems unmaintained now. I vaguely recall another program for this purpose but I can't remember the name... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438 #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I'm also wondering if having a fixed iteration number is a good idea. What if, in 5 years, 100 iterations for paraffins is not enough to get reliable numbers? I think `criterion` also has a solution for this (it does different number of repetitions depending on results). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438 #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): Here's a library that can be useful for this purpose: http://hackage.haskell.org/package/bench -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

- I wonder if it'd be better to run the process multiple times, instead of running the `main` function multiple times in the program. Why? That way we know GHC won't fuse or somehow optimize the `replicateM_ 100` call in
program, and we properly reset all the resources/global state (both
#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438 #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Replying to [comment:2 osa1]: the the
program's and the runtime system's, e.g. weaks pointers, threads, stable names). It just seems more reliable.
The whole point of this patch is that we iterate ''within'' the process, so that GC paramerisation doesn't affect performance (counted instructions, even) on the same program in 'unforeseen' ways, like the discontinuous, non-monotonic paraffins example. There we would expect that increasing nursery always leads to higher productivity, because the GC has to run less often. That clearly isn't the case, due to effects outlined in https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38. My hypothesis here is that fixing the productivity curve above has the following effect on benchmarking a patch that changes allocations: Basically, we fix the GC paramerisation and vary allocations instead of fixing allocations and varying GC parameters. For a fixed GC parameterisation (which is always the case when running NoFib), we would expect an optimisation that reduces allocations by 10% to lead to less GC time, thus higher productivity. Yet, in https://ghc.haskell.org/trac/ghc/ticket/9476#comment:55, there is a regression in total allocations by 18.6%, while counted instructions improve by 11.7% (similar results when measuring actual runtime). As the thread reveals, I had to do quite some digging to find out why (https://ghc.haskell.org/trac/ghc/ticket/9476#comment:71): The program produces more garbage, leading to more favorable points at which GC is done. We don't want that to dilute our comparison! This is also the reason why iterating the same program by restarting the whole process is impossible: GC paramerisation is deterministic (rightly so, IMO) and restarting the process will only measure the same dilusion over and over. On the other hand, iterating the logic $n times from within the program leads to different GC states at which the actual benchmark logic starts, thus leading to a more uniform distribution at the points in the program when GC happens. For every benchmark in the patch, I made sure that `replicateM_ 100` is actually a sensible thing to do. Compare that to the current version of [https://github.com/ghc/nofib/blob/f87d446b4e361cc82f219cf78917db9681af69b3/s... awards] where that is not the case: GHC will float out the actual computation and all that is measured is `IO`. In paraffins I used `relicateM_ 100` (or `forM_ [1..100] $ const`, rather, TODO), because the inner action has a call to `getArgs`, the result of which is used to build up the input data. GHC can't know that the result of `getArgs` doesn't change, so it can't memoize the benchmark result and this measures what we want. In other cases I actually had to use `forM_` and make use of the counter somewhere in generating the input data.
- You say "GC wibbles", but I'm not sure if these are actually GC wibbles. I just checked paraffins: it doesn't do any IO (other than printing the results), and it's not even threaded (does not use threaded runtime, does not do `forkIO`). So I think it should be quite deterministic, and I think any wibbles are due to OS side of things. In other words, if we could have an OS that only runs `paraffins` and nothing else I think the results would be quite deterministic.
Of course this doesn't change the fact that we're getting non- deterministic results and we should do something about it, I'm just trying to understand the root cause here.
The numbers are deterministic, but they are off in the sense above. By GC wibble, I mean that varying tiny parameters of the program or the GC has huge, non-monotonic, 'discontinuous' impact on the perceived performance, which makes the benchmark suite unsuited to evaluate any optimisation that changes allocations. Replying to [comment:3 osa1]:
I'm also wondering if having a fixed iteration number is a good idea. What if, in 5 years, 100 iterations for paraffins is not enough to get reliable numbers? I think `criterion` also has a solution for this (it does different number of repetitions depending on results).
The point of iterating 100 times is not to adjust runtime so that we measure something other than System overhead (type 1 in https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38). It's solely to get smoother results in the above sense. There's still the regular fast/norm/slow mechanism which are there to tune for specific runtimes, which are quite likely to vary in the future. Of course, we could just as well increment 100 to 243 for even smoother results instead of modifying the actual input problem. But having it fixed means that the curve for fast is as smooth as slow, which is a good thing, I think. It means we can measure counted instructions on fast without having to worry too much about said dilusions. Replying to [comment:4 osa1]:
Here's a library that can be useful for this purpose: http://hackage.haskell.org/package/bench
Ah, yes. That could indeed be worthwhile as a replacement for the current runner, but only if it would support measuring more than just time. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438 #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): So as I understand it, the GC "wibbles" you're talking about are caused by the number of GCs we run? Making a small change to the nursery size can make the difference between N and N+1 GC runs, which could be a large difference in runtime. You're only looking at `-G1`, right? Generational GC often has weird effects based on the timing of when a GC runs. I think there will still be issues when there's an old-gen collection right around the end of the program run - making a small change may mean the difference between running or not running the expensive GC. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438 #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Replying to [comment:6 simonmar]:
So as I understand it, the GC "wibbles" you're talking about are caused by the number of GCs we run? Making a small change to the nursery size can make the difference between N and N+1 GC runs, which could be a large difference in runtime.
Yes, that's one source of wibble (in hindsight, that may have been a bad term to use here). But it's not exactly the reason why I'm doing this: Have a look at the numbers in https://ghc.haskell.org/trac/ghc/ticket/9476#comment:55. The `./default` had significantly fewer Gen 0 collections and the same number of Gen 1 collections as `./allow-cg` (which produces more garbage but is faster in total). Gen 1 collections where cheaper for `./allow-cg` for some reason. Also note how this correlates with the productivity rate: 10% vs 15% for the latter. The findings in the thread led me to plot the above curves.
You're only looking at `-G1`, right? Generational GC often has weird
effects based on the timing of when a GC runs. I think there will still be issues when there's an old-gen collection right around the end of the program run - making a small change may mean the difference between running or not running the expensive GC. This is not `-G1` and I agree that a single old-gen collection might make the difference. But when we modify the program in a way that there are ''more'' Gen 1 collections, at more uniformly distributed points in the program, I argue we will have a much better experience comparing nofib numbers. There are multiple ways to achieve this, but I think the simplest one is what I outline above and more closely corresponds to the workload of real applications (e.g. long running time, growing and shrinking working sets). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438 #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Description changed by sgraf: Old description:
With Phab:D4989 (cf. #15357) having hit `nofib` master, there are still many benchmarks that are unstable. I identified three causes for unstability in https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38. With system overhead mostly out of the equation, there are still two related tasks left:
1. Identify benchmarks with GC wibbles. Plan: Look at counted instructions while varying heap size with just one generation. A wibbling benchmark should have quite diverse sampled maximum residency (as opposed to a microbenchmark, which should have quite stable instruction count).
Then fix these by iterating `main` 'often enough'. Maybe look at total bytes allocated for that, we want this to be monotonically declining as the initial heap size grows. 2. Now, all benchmarks should have stable instruction count. If not, maybe there's another class of benchmarks I didn't identify yet in #5793. Of these benchmarks, there are a few, like `real/eff/CS`, that still have highly unstable runtimes. Fix these 'microbenchmarks' by hiding them behind a flag.
New description: With Phab:D4989 (cf. #15357) having hit `nofib` master, there are still many benchmarks that are unstable in one way or another. I identified three causes for unstability in https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38. With system overhead mostly out of the equation, there are still two related tasks left: 1. Identify benchmarks with GC wibbles. Plan: Look at how productivity rate changes while increasing gen 0 heap size. A GC-sensitive benchmark should have a non-monotonic or discontinuous productivity-rate-over- nursery-size curve. Then fix these by iterating `main` often enough for the curve to become smooth and monotone. 2. Now, all benchmarks should have monotonically decreasing instruction count for increasing nursery sizes. If not, maybe there's another class of benchmarks I didn't identify yet in #5793. Of these benchmarks, there are a few, like `real/eff/CS`, that still have highly code layout-sensitive runtimes. Fix these 'microbenchmarks' by hiding them behind a flag. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15999: Stabilise nofib runtime measurements
-------------------------------------+-------------------------------------
Reporter: sgraf | Owner: (none)
Type: task | Status: new
Priority: normal | Milestone: ⊥
Component: NoFib benchmark | Version: 8.6.2
suite |
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438
#15333 #15357 |
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Sebastian Graf

#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: closed Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438 #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Changes (by sgraf): * status: new => closed * resolution: => fixed -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC