WebUI for GHC/Haskell tooling (eventlog)

Hello, I've written a blog post https://www.patreon.com/posts/41065262 about my WebUI based eventlog related tool. It is also related to eventlog2html and ghc-debug. I'm interested in your opinion and ideas related to ghc debug/profiling tooling. If you have time please read the post and it would be great to hear some positive or negative feedback from you. It would be great to discuss this topic with you. Thanks, Csaba

Csaba
I don’t know anything useful about tooling, WebUI, eventlog2html, etc. But I do know that these things are important, so thank you so much for working on them!
I assume you in touch with the (new, unified) team working on the Haskell IDE?
Thanks again
Simon
From: ghc-devs

Csaba It's really cool to see your work. Well-Typed and Hasura have recently started collaborating on some tooling (have a look at the announcement [1]). We are planning on taking the `ghc-debug` approach that you touched on at the end of your blog post, Csaba. While our approaches may be slightly different, we should keep in contact. Perhaps we'll be able to benefit from each other's work. -David E [1] https://hasura.io/blog/partnering-with-well-typed-and-investing-in-the-haske... On 9/8/20 11:41 AM, Simon Peyton Jones via ghc-devs wrote:
Csaba
I don’t know anything useful about tooling, WebUI, eventlog2html, etc. But I do know that these things are important, so thank you so much for working on them!
I assume you in touch with the (new, unified) team working on the Haskell IDE?
Thanks again
Simon
*From:*ghc-devs
*On Behalf Of *Csaba Hruska *Sent:* 31 August 2020 19:30 *To:* GHC developers *Subject:* WebUI for GHC/Haskell tooling (eventlog) Hello,
I've written ablog post https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.patreon.com%2Fposts%2F41065262&data=02%7C01%7Csimonpj%40microsoft.com%7C4abcd97d661e4fcbad2208d84ddbf825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637344954440665978&sdata=nhyZLT7K1zfDsa3CT%2BwOSe1DTYFQYKIWViDQiyKC7yY%3D&reserved=0 about my WebUI based eventlog related tool.
It is also related to eventlog2html and ghc-debug.
I'm interested in your opinion and ideas related to ghc debug/profiling tooling.
If you have time please read the post and it would be great to hear some positive or negative feedback from you. It would be great to discuss this topic with you.
Thanks,
Csaba
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
-- David Eichmann, Haskell Consultant Well-Typed LLP, http://www.well-typed.com Registered in England & Wales, OC335890 118 Wymering Mansions, Wymering Road, London W9 2NF, England

Hi, I read your post back when you posted it and meant to respond, but got distracted! Anyway, I think the profiling tools in ghc could definitely use some attention, so I'm glad you're looking into this! The below is going to seem like a rant, and maybe it is in some parts, but I mean it to be a constructive attempt to chart the gaps in documentation or tools. It has been observed many times that the haskell performance story is scattered about, and many people have suggested some kind of consolidation, which of course is always The Problem, especially for open source. So here I am observing that again, but there does seem to be promising movement as people get more interested in performance, and your efforts are encouraging. ### documentation It would be really nice to get more complete and detailed documentation of what the options are, and gather it into one place. This is a disorganized list of my own experiences: The time units in all the profiles seem mysterious. There's a "total time" in the .prof file. There's a time axis on the heap profile. There are times in the GC summary (INIT, MUT, ..., Total). None of these times seem to correspond with each other. What do they mean? Similarly, the "total bytes" in the prof file doesn't seem to correspond to anything in the GC summary. Long ago (maybe around 10 years ago) I think I intuited that the heap profile time is CPU time, which is what foiled my attempts to separate program phases with sleeps so I could see them. I resorted to live profiling with ekg, and more recently I have tried to use the eventlog and custom events for that (eventlog2html does draw the event positions, but the feature to show the event text didn't work for me). Anyway, there are many tools and techniques, but I haven't seen documentation tieing them together, along with advice and experience reports and all that good stuff. So I improvise. Here is my latest attempt, for ad-hoc profile exploration: https://github.com/elaforge/karya/blob/work/tools/run_profile.py It fiddles with all the flags I can never remember, collects and archives the results in a dated directory, runs all the various tools I can never remember (ghc-prof-flamegraph has been ostensibly the most useful, but see below about SCCs), and tries to extract a summary of the somewhat more stable numbers (GC stats and top profile cost centers) so I can diff them. Then there is a completely different attempt to get historical performance by running on known inputs, with the optimized non-profiling binary, extract the actual runtimes of various phases, and put them in a database to query later: https://github.com/elaforge/karya/tree/work/tools/timing This is because I don't trust profile-built binaries to be ground truth, even if it's just -prof and the eventlog runtime, no SCCs. I did some work to convert event logs to the chrome tracing format: https://github.com/elaforge/karya/blob/work/App/ConvertEventLog.hs In the end, I didn't use the graphical tracing, but just did ad-hoc analysis of the timestamps to see who was most expensive. The event format is another place where documentation would be nice, as you can see from the file, I just copy pasted the definition out of ghc and guessed what the types mean from their names. This was in the ghc 8.0 era I think, and I recall that the eventlog acquired heap data after that. I did get it working as a replacement for ThreadScope, and I think in general reusing a general framework that other people maintain will work better than a custom GTK app when the maintainer count is in the 0 to 1 range, though I recall chrome consumes JSON and trying to get that much data through JSON hit a wall eventually. I guess JSON should be theoretically capable of arbitrary sizes, so presumably it was that chrome is not optimized for large data... which might undercut the idea that it's better to use someone else's tool. Despite all of this, over the last 10 or so years, I have never managed to get predictable or consistent numbers, e.g. after a ghc version change they get dramatically worse in theory, but seem to be about the same wall clock time. Or they steadily creep down or up over long periods where no changes should have affected them, or there is no apparent improvement after a change that eliminated a top SCC entry, etc. etc. And this is without the confounding factor of changing hardware, since I do have hardware that's unchanged from 10 years back (I'm lazy about upgrading, ok?)... though of course hardware is confounding in general, and I haven't seen any techniques for how to control for that. Even on the same hardware, CPUs and OSes are quite non-deterministic, but the best approach I've seen there is criterion-style analysis for short benchmarks, and for long ones I just run them multiple times and hope they are below the noise floor. Anyway, I know all this stuff goes beyond just haskell and ghc, and is part of the general theme that profiling and benchmarking is hard and no one really seems to know how to do it satisfactorily. For example, here is a fun blog post on how even the mainstream VM world has apparently failed to get useful benchmarks on JIT: https://tratt.net/laurie/blog/entries/why_arent_more_users_more_happy_with_o... which reminds me of how it seems only recently did people realize, in the context of mtl vs. various free monads, that the key to mtl performance is monomorphic inlining. But despite all that, surely we can do better than just banging around blindly alone, as I've done. I think other people have had better success than I have, and I would love to learn from their examples. There is also a whole battery of language-agnostic low level tools from Intel and whatnot that the HPC or video games people use, and while ghc haskell can be a bit far from that, it doesn't mean they're useless... I've seen references to them used even for python. After just a little bit of time lurking on a rust-oriented chat, it seems like they think about performance (both throughput and latency) in a more rigorous and systematic way, and more connected to the broader performance-oriented community. Maybe similar to the way haskell has traditionally been more rigorous and systematic about abstractions and correctness, and more connected to the broader math-oriented community. The whole thing about SCCs could also use some documentation and advice. Due to some of the experiences above, I don't trust -fprof-auto-* flags, and I have seen some blog posts supporting that. The basic problem as I understand it is that SCCs prevent inlining, and inlining is the way important optimizations happen. But there they are, very tempting, and there is even a new one, -fprof-auto-exported, that seems to want to solve the inlining problem, but it can't really, because what you really want is SCCs on non-inlined functions, and I gather that's awkward given the order of the ghc pipeline. But cabal compiles all external libraries with "exported-functions" by default (which you have to look up to figure out that it's -fprof-auto-exported... can we pick consistent names?), with the result that (I think!) the SCCs stymie the inlining and specialization of (>>=), which, as has been documented (by which I mean the usual blog and reddit posts) to completely alter the performance of mtl style monadic code. So the first step is to set 'profiling-detail: none' and recompile the whole world, which used to be a lot more hassle, but I think cabal V2 has improved matters. But, all that said, I also understand why the auto-scc stuff is so tempting, just to give an overview before you try to zoom in manually with SCCs, because there are zillions of functions to annotate. What approach to use when? Has anyone come up with satisfying guidance? Then there are fascinating experiments like https://github.com/Petrosz007/haskell-profile-highlight , which is something I dreamed about from the beginning, except that it relies on pervasive SCCs so... is it ok to build on that foundation? Oh and speaking of SCCs, there's a bug (?) where the entries is 0 sometimes. Every once in a while someone posts somewhere asking about that and no one seems to know. Every time I do biographical profiling I have to remind myself what exactly are LAG, DRAG, INHERENT_USE, VOID. So I search my gmail box, because the only documentation I have is a very helpful response Simon Marlow sent to me asking those very questions 10 years back, and the original 1996 biographical profiling paper (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.1219) that he mentions. To this day, if I search the mailing list archives for INHERENT_USE that is the only message that comes up! The paper is still relevant, because it seems the implementation hasn't changed much since 1996 either, but dummies like me need lots of examples and case studies for things to stick. Also INHERENT_USE seems to date from pre-bytestring days when no one had significant data in ByteArrays so it was ok to just handwave it away. That isn't the case anymore, and it means that often most data is not tracked. There is a ghc ticket to improve the situation: https://gitlab.haskell.org/ghc/ghc/-/issues/7275 There has been consistent interest over its 7 years, looks like it just lacks a volunteer! Then there is folk knowledge about what ARR_WORDS is. I recently stumbled across a very helpful post by Ben Gamari: https://bgamari.github.io/posts/2016-03-30-what-is-this-array.html There are a bunch of other internal closure types though, which as far as I know require knowing ghc internals to understand. And then there is a whole zoo of ad-hoc techniques scattered across blog posts over the last 10 years or so: Neil Mitchell's stack-limiting leak-finder, Simon Marlow's weak pointer leak finder, and an absolutely heroic post about using gdb to directly inspect ghc data structures and find leaks: https://lukelau.me/haskell/posts/leak/ Here's a recent one about memory fragmentation, that might also be the answer to my bytes discrepancy questions above: https://www.well-typed.com/blog/2020/08/memory-fragmentation/ And of course various ghc pragmas which are actually pretty well documented, but the advice on how to use them is still scattered around in blog posts: INLINE vs. INLINABLE vs. SPECIALIZE, rewrite rules, etc. And then there's folk knowledge about libraries and data structures, e.g. Writer being inefficient so use strict StateT instead, but someone also put up writer-cps-mtl, but hey it says it was merged into 'transformers' so maybe that's all obsolete now? And Either/ExceptT is also inefficient but in theory CPS transform fixes that too... but still no except-cps-mtl? I wrote my own by hand, which seemed to be what everyone was doing at the time, but as usual I couldn't demonstrate an actual performance improvement from it. By the way, I assume that is the answer to the attoparsec question on https://www.reddit.com/r/haskell/comments/ir3hmr/compiling_systems_haskell_r... And the existence of short-text and short-bytestring, and of course the famousest folk knowledge, which is difference lists, but actually sometimes they hurt more than help, and no one seems to mention that. Or the AppendList (called OrdList in ghc source) which never seemed to gain significant popularity... including with me, since I couldn't get it to demonstrate a performance improvement over [] and (++)... but ghc does use it and maybe you just have to use it right? There's also some folk wisdom about LLVM-for-your-loops and vectorization... e.g. I noticed that a foreign call to a C function that does a nested loop to sum buffers of floats is an order of magnitude faster than a loop in ST with unsafeWrite, which is another order of magnitude faster than the high-level Unboxed.Vector.zipWith stuff, and I assume auto vectorization might be to blame. Of course it seems no one knows how to do it without one of flaky automatic optimization, or grungy explicit intrinsics calls, or an entirely new DSL or language, though I suppose haskell does have entries such as 'accelerate' and 'repa'. But anyway, that's getting into low level performance and numerics, which is a whole specialized field on its own, and it seems hard to port its solutions into general purpose code.
participants (4)
-
Csaba Hruska
-
David Eichmann
-
Evan Laforge
-
Simon Peyton Jones