
Eric Seidel
On Thu, May 11, 2017, at 09:12, David Turner wrote:
Could anyone help, e.g. by pointing me at the bit in the Core that is allocating within the main loop?
GHC has a -ticky flag that tracks precisely where allocations are happening. It's quite low level (you may have to stare at the STG in addition to the Core to interpret the results) but I've found it incredibly useful when trying to understand performance swings in GHC's own benchmarks.
https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
You can enable it per module, which is nice break from the rebuild-the-world -prof, but the flip side IIRC is that if the allocations are happening in a module that wasn't compiled with -ticky (e.g. somewhere inside Data.Array) you might not see them.
Indeed, -ticky is wonderful for this sort of thing since it doesn't affect Core-to-Core optimization, meaning that the instrumented program will behave very similarly to the uninstrumented version (just a bit slower due to counter overhead). I had a quick look at your example over lunch. I started like this (after commenting out the aeson benchmark in Main), $ ghc app/Main.hs -isrc -O2 -ddump-stg -fforce-recomp \ -ddump-to-file -dsuppress-idinfo -ddump-simpl -rtsopts \ -ticky -ticky-allocd $ app/Main +RTS -rticky.out This produces a Ticky report (named ticky.out) in the current directory. This includes a table which in my case looked something like, Entries Alloc Alloc'd Non-void Arguments STG Name -------------------------------------------------------------------------------- 123 192 48 1 L go9{v slyX} (main@main:Automaton) 2 0 32 1 L go8{v slyH} (main@main:Automaton) 1465016 58600640 146501600 0 $w$j1{v slMc} (main@main:Automaton) in slHB 161884268 011761148448 0 $w$j1{v slJT} (main@main:Automaton) in slHA 0 0 11720128 4 ppww $s$wgo2{v slHz} (main@main:Automaton) in r4s 16353241111790448768 13185144 5 wwLpp $wgo7{v slHA} (main@main:Automaton) in r4s 1831270 167011824 13185144 6 ppwLww $s$wgo1{v slHB} (main@main:Automaton) in r4s 183127 0 13185144 0 $w$j{v slGN} (main@main:Automaton) in r4s 992 8624 10944 1 L go7{v slwV} (main@main:Automaton) in r4m 3 72 0 1 C main@main:Automaton.showGraph120{v r3O} 681 0 0 2 ML go6{v rkSd} (main@main:Automaton) The important things we care about here are Alloc (that is how many bytes each name allocates) and Alloc'd (that is, how many of each closure are allocated in bytes). [Aside: we explicitly enabled the Alloc'd column during compilation with the -ticky-allocd flag; if you omit it this column will contain only zeros. In general I find it helpful to have so I generally enable it, despite the overhead it introduces.] The table suggests a few nice starting points: $w$j1_slJT is allocated quite a bit, probably by $wgo7_slHA. Next we can turn to the STG output (src/Automaton.dump-stg) to try to understand why this is. Reading STG generally tends to be quite hard due to the rather large, deeply nested trees that most programs compile to. That being said, a good editor can help immensely (I also have my ghc-dump [1] package which I've been meaning to add STG support to, which would enable some interesting analysis techniques). Unfortunately, my lunch is now all gone so I'll have to drop this here for now. However, I'll try to pick it up again tonight. Do keep us apprised of your progress! Cheers, - Ben [1] https://github.com/bgamari/ghc-dump