
#15999: Stabilise nofib runtime measurements -------------------------------------+------------------------------------- Reporter: sgraf | Owner: (none) Type: task | Status: new Priority: normal | Milestone: ⊥ Component: NoFib benchmark | Version: 8.6.2 suite | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #5793 #9476 | Differential Rev(s): Phab:D5438 #15333 #15357 | Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * cc: osa1 (added) * differential: => Phab:D5438 Comment: Thanks for doing this! I think running the benchmarks multiple times is a good idea. That's what `criterion` does, and it provides quite reliable results, even for very fast programs. That said, looking at the patch and your `paraffins` example, I have some questions: - I wonder if it'd be better to run the process multiple times, instead of running the `main` function multiple times in the program. Why? That way we know GHC won't fuse or somehow optimize the `replicateM_ 100` call in the program, and we properly reset all the resources/global state (both the program's and the runtime system's, e.g. weaks pointers, threads, stable names). It just seems more reliable. - Of course this would make the analysis harder as each run will print GC stats which we need to parse and somehow combine ... - That said, I wonder if GC numbers are important for the purposes of nofib. In nofib we care about allocations and runtimes, as long as these numbers are stable it should be fine. So perhaps it's not too hard to repeat the process run instead of `main` function. - You say "GC wibbles", but I'm not sure if these are actually GC wibbles. I just checked paraffins: it doesn't do any IO (other than printing the results), and it's not even threaded (does not use threaded runtime, does not do `forkIO`). So I think it should be quite deterministic, and I think any wibbles are due to OS side of things. In other words, if we could have an OS that only runs `paraffins` and nothing else I think the results would be quite deterministic. Of course this doesn't change the fact that we're getting non- deterministic results and we should do something about it, I'm just trying to understand the root cause here. On my first point: if a solution for benchmarking "processes" (instead of "functions") using criterion-style iteration (by which I mean "provides stable results") I think it may worth trying. Few years back we used `hsbencher` for this purpose at IU, but IIRC it's a bit too heavy (lots of dependencies), and it seems unmaintained now. I vaguely recall another program for this purpose but I can't remember the name... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15999#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler