Fast JSON validation - reducing allocations

Dear Café, We have a stream of ~900byte JSON documents (lazy ByteStrings) that we would like to validate (NB not parse) as quickly as possible. We just need to check that the bytes are a syntactically valid JSON document and then pass them elsewhere; we do not care what the content is. The documents are encoded in UTF-8 and have no extraneous whitespace. We have made a number of attempts: 1. Use Aeson to parse the documents and then discard the result. This was actually pretty good: Criterion clocked it at ~30us per document. 2. A hand-written Attoparsec parser. This did quite a bit worse: ~70us per document. 3. Use Get (from the binary package), which was a bit messier as it was factored to avoid all backtracking, but managed to get down to around ~20us per document. 4. Build a table-based, Word8-labelled DFA (with a minor hack to deal with values-within-values) and run it with ByteString.Lazy.foldl'. Also around ~20us per document. We didn't pursue the Attoparsec idea any further, but the other three all allocate ~85kB per document, and this now seems to be the limiting factor for performance. In more detail, looking at the threadscope output, we see it parse 5 or 6 documents in ~150us (allocating ~500kB) and then run the GC, which takes about another 150us. The GC stats indicate that it's only copying a couple of kB of live data each time, so most of that 500kB is junk. Now 85kB per document is ~96B per byte, which is only a pointer-and-a-half. It's not entirely surprising that there's this much allocation going on each time round the loop, but we'd like to try and get rid of it. I've tried the usual strictness tricks and have peered in a mystified fashion at the output of -ddump-simpl but am none the wiser. There's a copy of the relevant code for option 4 at https://github.com/DaveCTurner/json-validator. I've hacked around with it a bit since producing the numbers above, so it's now apparently a bit slower than Aeson but allocates less too (~65kB). Could anyone help, e.g. by pointing me at the bit in the Core that is allocating within the main loop? Many thanks, David

On 2017-05-11 18:12, David Turner wrote:
Dear Café,
We have a stream of ~900byte JSON documents (lazy ByteStrings) that we would like to validate (NB not parse) as quickly as possible. We just need to check that the bytes are a syntactically valid JSON document and then pass them elsewhere; we do not care what the content is. The documents are encoded in UTF-8 and have no extraneous whitespace.
No particular recommendations, but you might want to look into semi-indexing[1] as a strategy. It looks plausible that it would be possible to do that without a lot of allocation; see the paper for details. (I there's also a demo implementation in Python on GH.) [1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf

Interesting, thanks for the link. However we're trying to do _even less_
than that - this is just the "scan the document to produce the Elias-Fano
encoding" step, except without needing to keep a hold of it for later. It's
not quite as trivial as that paper makes out as (a) it doesn't mention the
possibility that the documents might not be well-formed, and (b) it doesn't
really talk about dealing with the special delimiting characters within
string literals, for which you need to drag along a bunch of state while
you're scanning.
This'd be pretty trivial to do in a C-like language now we've got the DFA
tables built, and I may resort to that at some point if we can't work out
how to avoid the allocations in Haskell-land, but I'd quite like to be able
to solve problems of this form without unnecessarily resorting to C.
Cheers,
On 11 May 2017 at 17:30, Bardur Arantsson
On 2017-05-11 18:12, David Turner wrote:
Dear Café,
We have a stream of ~900byte JSON documents (lazy ByteStrings) that we would like to validate (NB not parse) as quickly as possible. We just need to check that the bytes are a syntactically valid JSON document and then pass them elsewhere; we do not care what the content is. The documents are encoded in UTF-8 and have no extraneous whitespace.
No particular recommendations, but you might want to look into semi-indexing[1] as a strategy. It looks plausible that it would be possible to do that without a lot of allocation; see the paper for details. (I there's also a demo implementation in Python on GH.)
[1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

On 2017-05-11 18:59, David Turner wrote:
Interesting, thanks for the link. However we're trying to do _even less_ than that - this is just the "scan the document to produce the Elias-Fano encoding" step, except without needing to keep a hold of it for later. It's not quite as trivial as that paper makes out as (a) it doesn't mention the possibility that the documents might not be well-formed, and (b) it doesn't really talk about dealing with the special delimiting characters within string literals, for which you need to drag along a bunch of state while you're scanning.
My thinking was actually that it could be used reduce the allocations since everything (AFAIR, it's been a while) would basically just be linear datastructures and indexes.
This'd be pretty trivial to do in a C-like language now we've got the DFA tables built, and I may resort to that at some point if we can't work out how to avoid the allocations in Haskell-land, but I'd quite like to be able to solve problems of this form without unnecessarily resorting to C.
We'll there's always https://hackage.haskell.org/package/inline-c :) Regards,

On Thu, May 11, 2017, at 09:12, David Turner wrote:
Could anyone help, e.g. by pointing me at the bit in the Core that is allocating within the main loop?
GHC has a -ticky flag that tracks precisely where allocations are happening. It's quite low level (you may have to stare at the STG in addition to the Core to interpret the results) but I've found it incredibly useful when trying to understand performance swings in GHC's own benchmarks. https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky You can enable it per module, which is nice break from the rebuild-the-world -prof, but the flip side IIRC is that if the allocations are happening in a module that wasn't compiled with -ticky (e.g. somewhere inside Data.Array) you might not see them. Hope this helps! Eric

Eric Seidel
On Thu, May 11, 2017, at 09:12, David Turner wrote:
Could anyone help, e.g. by pointing me at the bit in the Core that is allocating within the main loop?
GHC has a -ticky flag that tracks precisely where allocations are happening. It's quite low level (you may have to stare at the STG in addition to the Core to interpret the results) but I've found it incredibly useful when trying to understand performance swings in GHC's own benchmarks.
https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
You can enable it per module, which is nice break from the rebuild-the-world -prof, but the flip side IIRC is that if the allocations are happening in a module that wasn't compiled with -ticky (e.g. somewhere inside Data.Array) you might not see them.
Indeed, -ticky is wonderful for this sort of thing since it doesn't affect Core-to-Core optimization, meaning that the instrumented program will behave very similarly to the uninstrumented version (just a bit slower due to counter overhead). I had a quick look at your example over lunch. I started like this (after commenting out the aeson benchmark in Main), $ ghc app/Main.hs -isrc -O2 -ddump-stg -fforce-recomp \ -ddump-to-file -dsuppress-idinfo -ddump-simpl -rtsopts \ -ticky -ticky-allocd $ app/Main +RTS -rticky.out This produces a Ticky report (named ticky.out) in the current directory. This includes a table which in my case looked something like, Entries Alloc Alloc'd Non-void Arguments STG Name -------------------------------------------------------------------------------- 123 192 48 1 L go9{v slyX} (main@main:Automaton) 2 0 32 1 L go8{v slyH} (main@main:Automaton) 1465016 58600640 146501600 0 $w$j1{v slMc} (main@main:Automaton) in slHB 161884268 011761148448 0 $w$j1{v slJT} (main@main:Automaton) in slHA 0 0 11720128 4 ppww $s$wgo2{v slHz} (main@main:Automaton) in r4s 16353241111790448768 13185144 5 wwLpp $wgo7{v slHA} (main@main:Automaton) in r4s 1831270 167011824 13185144 6 ppwLww $s$wgo1{v slHB} (main@main:Automaton) in r4s 183127 0 13185144 0 $w$j{v slGN} (main@main:Automaton) in r4s 992 8624 10944 1 L go7{v slwV} (main@main:Automaton) in r4m 3 72 0 1 C main@main:Automaton.showGraph120{v r3O} 681 0 0 2 ML go6{v rkSd} (main@main:Automaton) The important things we care about here are Alloc (that is how many bytes each name allocates) and Alloc'd (that is, how many of each closure are allocated in bytes). [Aside: we explicitly enabled the Alloc'd column during compilation with the -ticky-allocd flag; if you omit it this column will contain only zeros. In general I find it helpful to have so I generally enable it, despite the overhead it introduces.] The table suggests a few nice starting points: $w$j1_slJT is allocated quite a bit, probably by $wgo7_slHA. Next we can turn to the STG output (src/Automaton.dump-stg) to try to understand why this is. Reading STG generally tends to be quite hard due to the rather large, deeply nested trees that most programs compile to. That being said, a good editor can help immensely (I also have my ghc-dump [1] package which I've been meaning to add STG support to, which would enable some interesting analysis techniques). Unfortunately, my lunch is now all gone so I'll have to drop this here for now. However, I'll try to pick it up again tonight. Do keep us apprised of your progress! Cheers, - Ben [1] https://github.com/bgamari/ghc-dump

Ccing Luke Maurer under the assumption that he will appreciate seeing
the fruits of his labor.
David Turner
Dear Café,
(snip)
There's a copy of the relevant code for option 4 at https://github.com/DaveCTurner/json-validator. I've hacked around with it a bit since producing the numbers above, so it's now apparently a bit slower than Aeson but allocates less too (~65kB).
Could anyone help, e.g. by pointing me at the bit in the Core that is allocating within the main loop?
While turning this over in my head I realized that this is the sort of program which may be helped significantly by GHC 8.2's improved join point handling. Indeed the timings seem to support this hypothesis: GHC 8.0.2 benchmarking json-validator/Automaton/testEvent time 22.84 μs (22.76 μs .. 22.94 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 22.84 μs (22.76 μs .. 22.94 μs) std dev 297.4 ns (221.8 ns .. 378.8 ns) Alloc rate 4,015,256,078,038 bytes per MUT second GHC 8.2 benchmarking json-validator/Automaton/testEvent time 9.221 μs (9.141 μs .. 9.318 μs) 0.998 R² (0.996 R² .. 1.000 R²) mean 9.163 μs (9.084 μs .. 9.356 μs) std dev 399.8 ns (193.0 ns .. 745.4 ns) variance introduced by outliers: 54% (severely inflated) Alloc rate 123,141,635 bytes per MUT second Wow! I suspect your allocations have now vanished. I didn't verify that the improvement really was due to more join points, but it seems quite likely. Well done Luke! Cheers, - Ben
participants (4)
-
Bardur Arantsson
-
Ben Gamari
-
David Turner
-
Eric Seidel