Re: [Haskell-cafe] Fast JSON validation - reducing allocations

Ben, Eric, thanks for your help. A whole new world of low-level statistics
opens up before me...
On 11 May 2017 at 18:57, Ben Gamari
Eric Seidel
writes: On Thu, May 11, 2017, at 09:12, David Turner wrote:
Could anyone help, e.g. by pointing me at the bit in the Core that is allocating within the main loop?
GHC has a -ticky flag that tracks precisely where allocations are happening. It's quite low level (you may have to stare at the STG in addition to the Core to interpret the results) but I've found it incredibly useful when trying to understand performance swings in GHC's own benchmarks.
https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
You can enable it per module, which is nice break from the rebuild-the-world -prof, but the flip side IIRC is that if the allocations are happening in a module that wasn't compiled with -ticky (e.g. somewhere inside Data.Array) you might not see them.
Indeed, -ticky is wonderful for this sort of thing since it doesn't affect Core-to-Core optimization, meaning that the instrumented program will behave very similarly to the uninstrumented version (just a bit slower due to counter overhead).
I had a quick look at your example over lunch. I started like this (after commenting out the aeson benchmark in Main),
$ ghc app/Main.hs -isrc -O2 -ddump-stg -fforce-recomp \ -ddump-to-file -dsuppress-idinfo -ddump-simpl -rtsopts \ -ticky -ticky-allocd $ app/Main +RTS -rticky.out
Ok, I was just about to ask about the Alloc'd column as mine was all zeroes, but the -ticky-allocd was what I needed, thanks.
The table suggests a few nice starting points: $w$j1_slJT is allocated quite a bit, probably by $wgo7_slHA. Next we can turn to the STG output (src/Automaton.dump-stg) to try to understand why this is. Reading STG generally tends to be quite hard due to the rather large, deeply nested trees that most programs compile to.
You're not wrong. I've trawled through the STG as best as I can for a newbie. The function with a large number in the 'Alloc' column looks pretty much to be the heart of the `step` function. There's a lot of nesting but it looks largely to be just checking the bounds on array indices, which I don't think allocates anything. However I see this this little snippet at the very bottom of the nesting: case indexArray# [ds1_sl1b y2_sl5H] of _ [Occ=Dead] { (##) ipv8_sl60 [Occ=Once!] -> case ipv8_sl60 of _ [Occ=Dead] { GHC.Word.W8# dt6_sl62 [Occ=Once] -> case ss_sl5v of dt7_sl63 { __DEFAULT -> (#,,#) [__word 1 dt6_sl62 dt7_sl63]; }; }; }; I'm guessing that the `indexArray` call is the line that reads `nextState = aTransitionsTable makeAutomaton AU.! (currentState, nextByte)`, and to me it looks like it might be unboxing the thing it's getting out of the array. Not sure that counts as an allocation, but I'm surprised anyway. I'll keep digging anyway, but it'd be good to hear of any progress at your end too. Hope it didn't spoil your lunch! Cheers,
While turning this over in my head I realized that this is the sort of program which may be helped significantly by GHC 8.2's improved join point handling. Indeed the timings seem to support this hypothesis:
GHC 8.0.2
benchmarking json-validator/Automaton/testEvent time 22.84 μs (22.76 μs .. 22.94 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 22.84 μs (22.76 μs .. 22.94 μs) std dev 297.4 ns (221.8 ns .. 378.8 ns)
Alloc rate 4,015,256,078,038 bytes per MUT second
GHC 8.2
benchmarking json-validator/Automaton/testEvent time 9.221 μs (9.141 μs .. 9.318 μs) 0.998 R² (0.996 R² .. 1.000 R²) mean 9.163 μs (9.084 μs .. 9.356 μs) std dev 399.8 ns (193.0 ns .. 745.4 ns) variance introduced by outliers: 54% (severely inflated)
Alloc rate 123,141,635 bytes per MUT second
Wow! I suspect your allocations have now vanished.
Ooo, that's more like it.
Could you run again using the following to get Criterion's estimate of the
allocations-per-call?
json-validator-exe --regress allocated:iters +RTS -T
Cheers,
David
On 11 May 2017 at 19:45, David Turner
On 11 May 2017 at 19:40, Ben Gamari
wrote: Ccing Luke Maurer under the assumption that he will appreciate seeing the fruits of his labor.
David Turner
writes: Dear Café,
(snip)
There's a copy of the relevant code for option 4 at https://github.com/DaveCTurner/json-validator. I've hacked around with
it a
bit since producing the numbers above, so it's now apparently a bit slower than Aeson but allocates less too (~65kB).
Could anyone help, e.g. by pointing me at the bit in the Core that is allocating within the main loop?
While turning this over in my head I realized that this is the sort of program which may be helped significantly by GHC 8.2's improved join point handling. Indeed the timings seem to support this hypothesis:
GHC 8.0.2
benchmarking json-validator/Automaton/testEvent time 22.84 μs (22.76 μs .. 22.94 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 22.84 μs (22.76 μs .. 22.94 μs) std dev 297.4 ns (221.8 ns .. 378.8 ns)
Alloc rate 4,015,256,078,038 bytes per MUT second
GHC 8.2
benchmarking json-validator/Automaton/testEvent time 9.221 μs (9.141 μs .. 9.318 μs) 0.998 R² (0.996 R² .. 1.000 R²) mean 9.163 μs (9.084 μs .. 9.356 μs) std dev 399.8 ns (193.0 ns .. 745.4 ns) variance introduced by outliers: 54% (severely inflated)
Alloc rate 123,141,635 bytes per MUT second
Wow! I suspect your allocations have now vanished.
Ooo, that's more like it.
Could you run again using the following to get Criterion's estimate of the allocations-per-call?
json-validator-exe --regress allocated:iters +RTS -T
Cheers,
David
participants (1)
-
David Turner