Re: [Haskell-cafe] Fast JSON validation - reducing allocations

David Turner
Ben, Eric, thanks for your help. A whole new world of low-level statistics opens up before me...
No worries!
You're not wrong.
I've trawled through the STG as best as I can for a newbie. The function with a large number in the 'Alloc' column looks pretty much to be the heart of the `step` function. There's a lot of nesting but it looks largely to be just checking the bounds on array indices, which I don't think allocates anything.
Something that I should have mentioned earlier is that STG has the nice property that all allocation is syntactically obvious: allocated closures manifest as `let`s. This makes it fairly easy to pick out possible allocation sites, even in large dumps.
However I see this this little snippet at the very bottom of the nesting:
case indexArray# [ds1_sl1b y2_sl5H] of _ [Occ=Dead] { (##) ipv8_sl60 [Occ=Once!] -> case ipv8_sl60 of _ [Occ=Dead] { GHC.Word.W8# dt6_sl62 [Occ=Once] -> case ss_sl5v of dt7_sl63 { __DEFAULT -> (#,,#) [__word 1 dt6_sl62 dt7_sl63]; }; }; };
I'm guessing that the `indexArray` call is the line that reads `nextState = aTransitionsTable makeAutomaton AU.! (currentState, nextByte)`, and to me it looks like it might be unboxing the thing it's getting out of the array. Not sure that counts as an allocation, but I'm surprised anyway.
The unboxing should just involve looking at the value field of the W8# closure (e.g. a memory dereference) and putting the value in a machine register. There is no allocation here as far as I can see.
I'll keep digging anyway, but it'd be good to hear of any progress at your end too. Hope it didn't spoil your lunch!
Far from it! It was nice to have a puzzle to ponder. Cheers, - Ben

Morning all,
On 11 May 2017 at 20:32, Ben Gamari
Something that I should have mentioned earlier is that STG has the nice property that all allocation is syntactically obvious: allocated closures manifest as `let`s. This makes it fairly easy to pick out possible allocation sites, even in large dumps.
Ah, that's very useful to know! Armed with that knowledge, I came to the conclusion that the allocation was for the sharing of the `nextState` variable. Inlining it brings it down to 20us and 22kB per iteration. https://github.com/DaveCTurner/json-validator/commit/ec994ec9226ca7bc2e76f19... Getting closer, but it looks like waiting for 8.2 is a better answer. Looking forward to it! Cheers,

On Fri, 2017-05-12 at 09:10 +0100, David Turner wrote:
Morning all,
On 11 May 2017 at 20:32, Ben Gamari
wrote: Something that I should have mentioned earlier is that STG has the nice property that all allocation is syntactically obvious: allocated closures manifest as `let`s. This makes it fairly easy to pick out possible allocation sites, even in large dumps.
Ah, that's very useful to know!
Armed with that knowledge, I came to the conclusion that the allocation was for the sharing of the `nextState` variable. Inlining it brings it down to 20us and 22kB per iteration.
https://github.com/DaveCTurner/json-validator/commit/ec994ec9226ca7bc 2e76f19bef98f42e0b233524
Getting closer, but it looks like waiting for 8.2 is a better answer. Looking forward to it!
Cheers,
Maybe this is a silly question, and please let me know why if so, but: Has anyone thought about parallelizing it for multiple messages in order to "produce garbage faster"? While reducing allocation will make the single validations faster, doing multiple ones might improve the throughput per GC ratio. This assumes that the amount of live data in the heap is small, making GC sort of constant time, and having multiple cores available. I wonder whether a few strategically placed par's and pseq's might allow you to scale horizontally without requiring invasive changes to the program's structure. Apologies for not trying to do this myself first ;-). kind regards, Arjen

On 12 May 2017 at 09:27, Arjen
Maybe this is a silly question, and please let me know why if so, but:
Has anyone thought about parallelizing it for multiple messages in order to "produce garbage faster"? While reducing allocation will make the single validations faster, doing multiple ones might improve the throughput per GC ratio. This assumes that the amount of live data in the heap is small, making GC sort of constant time, and having multiple cores available.
Not a silly question at all. Adding the following incantation: `using` parListChunk 100 rseq does quite happily spread things across all 4 cores on my development machine, and it's certainly a bit faster. To give some stats, it processes ~24 events between GCs rather than ~6, and collects ~2MB rather than ~500kB. The throughput becomes a lot less consistent, however, at least partly due to some bigger GC pauses along the way. As Ben's statistics showed, our allocation rate on one thread is around 4TBps, which is already stressing the GC out a bit, and parallelising it doesn't make that problem any easier. I know in the OP I said "we have a stream" (accidental MLK misquote) but in fact there are a bunch of parallel streams whose number exceeds the number of available cores, so we don't anticipate any enormous benefit from spreading the processing of any one stream across multiple cores: single-threaded performance is what we think we should be concentrating on. Cheers, David

On Fri, 2017-05-12 at 10:48 +0100, David Turner wrote:
On 12 May 2017 at 09:27, Arjen
wrote: Maybe this is a silly question, and please let me know why if so, but:
Has anyone thought about parallelizing it for multiple messages in order to "produce garbage faster"? While reducing allocation will make the single validations faster, doing multiple ones might improve the throughput per GC ratio. This assumes that the amount of live data in the heap is small, making GC sort of constant time, and having multiple cores available.
Not a silly question at all. Adding the following incantation:
`using` parListChunk 100 rseq
does quite happily spread things across all 4 cores on my development machine, and it's certainly a bit faster. To give some stats, it processes ~24 events between GCs rather than ~6, and collects ~2MB rather than ~500kB. The throughput becomes a lot less consistent, however, at least partly due to some bigger GC pauses along the way. As Ben's statistics showed, our allocation rate on one thread is around 4TBps, which is already stressing the GC out a bit, and parallelising it doesn't make that problem any easier.
I know in the OP I said "we have a stream" (accidental MLK misquote) but in fact there are a bunch of parallel streams whose number exceeds the number of available cores, so we don't anticipate any enormous benefit from spreading the processing of any one stream across multiple cores: single-threaded performance is what we think we should be concentrating on.
Cheers,
David
Happy to read that you have more success than I with parListChunk. I expect(ed) the sparks to be cheap, so this might help if the various incoming streams are very unbalanced in length and/or message sizes. I agree that multi-processing a single stream does not make much sense if you already have multiple concurrent streams. Nice to see that adding on parallelism in Haskell (GHC) is that easy (in this case) and with a very good speed-up factor! kind regards, Arjen

On 12 May 2017 at 11:00, Arjen
Happy to read that you have more success than I with parListChunk. I expect(ed) the sparks to be cheap, so this might help if the various incoming streams are very unbalanced in length and/or message sizes. I agree that multi-processing a single stream does not make much sense if you already have multiple concurrent streams.
Nice to see that adding on parallelism in Haskell (GHC) is that easy (in this case) and with a very good speed-up factor!
kind regards, Arjen
Ah, ok, this is strange. The production system is definitely faster with the parallel strategy (on one stream) but my Criterion benchmarks are much slower. I've updated the code and results on Github ( https://github.com/DaveCTurner/json-validator). Probably not going to be able to look at the parallel version much more, but thought you might like a look. Cheers,

On Fri, 2017-05-12 at 10:48 +0100, David Turner wrote:
On 12 May 2017 at 09:27, Arjen
wrote: Maybe this is a silly question, and please let me know why if so, but:
Has anyone thought about parallelizing it for multiple messages in order to "produce garbage faster"? While reducing allocation will make the single validations faster, doing multiple ones might improve the throughput per GC ratio. This assumes that the amount of live data in the heap is small, making GC sort of constant time, and having multiple cores available.
Not a silly question at all. Adding the following incantation:
`using` parListChunk 100 rseq
does quite happily spread things across all 4 cores on my development machine, and it's certainly a bit faster. To give some stats, it processes ~24 events between GCs rather than ~6, and collects ~2MB rather than ~500kB. The throughput becomes a lot less consistent, however, at least partly due to some bigger GC pauses along the way. As Ben's statistics showed, our allocation rate on one thread is around 4TBps, which is already stressing the GC out a bit, and parallelising it doesn't make that problem any easier.
I know in the OP I said "we have a stream" (accidental MLK misquote) but in fact there are a bunch of parallel streams whose number exceeds the number of available cores, so we don't anticipate any enormous benefit from spreading the processing of any one stream across multiple cores: single-threaded performance is what we think we should be concentrating on.
Cheers,
David
Apologies for spamming, but if you want more mutator time between garbage collections, try increasing the nursery size. I get a lot less time in GC and also much less sync's on GC (according to -s). It seems to reduce the execution time by a third using something -A8M or -A64M (quite extreem). In a large program, this effect might be less. kind regards, Arjen

David Turner
On 12 May 2017 at 09:27, Arjen
wrote: Maybe this is a silly question, and please let me know why if so, but:
Has anyone thought about parallelizing it for multiple messages in order to "produce garbage faster"? While reducing allocation will make the single validations faster, doing multiple ones might improve the throughput per GC ratio. This assumes that the amount of live data in the heap is small, making GC sort of constant time, and having multiple cores available.
Not a silly question at all. Adding the following incantation:
`using` parListChunk 100 rseq
does quite happily spread things across all 4 cores on my development machine, and it's certainly a bit faster. To give some stats, it processes ~24 events between GCs rather than ~6, and collects ~2MB rather than ~500kB. The throughput becomes a lot less consistent, however, at least partly due to some bigger GC pauses along the way. As Ben's statistics showed, our allocation rate on one thread is around 4TBps, which is already stressing the GC out a bit, and parallelising it doesn't make that problem any easier.
To elaborate on this a bit: in my experience allocation rate becomes even more important for performance (both latency and throughput) when introducing parallelism. The reason is that our GC is a stop-the-world collector and consequently whenever one thread needs to GC, it must stop and wait for all others to stop as well. This means that your program will be synchronizing constantly, even if your threads are working on completely non-intersecting sets of heap objects, greatly limiting your usable parallelism. You can observe this behavior using the eventlog. It is possible to tune the GC to trade latency for throughput without changing your program. Increasing the size of the nursery will mean that each thread will take longer to fill its nursery and consequently require less frequent GC. However, this means that each minor GC will take longer. For this reason, if your program is truly embarassingly parallel, it is usually preferrable performance-wise to run multiple processes concurrently than use pure parallelism (a sad truth in a harsh reality, in my opinion). Note that there have been efforts to help this issue in the past: Simon Marlow has a paper [1] examining a GC scheme with thread-local heaps, allowing threads to perform minor collections while other threads continue to run. Unfortunately this introduced a significant amount of complexity into the RTS which the performance numbers just didn't justify. The new compact regions feature also bears on this issue, as it gives users tools for better managing the effect of their long-lived live data on GC performance. That being said, this wouldn't help in your case as you have little long-lived live data. Some form of the linear types proposal also may some day offer some help here as it could in principle allow alternatives to today's GC'd heap. For instance, in your case you may want to use an arena allocation strategy: when starting a document allocate an arena which will serve as your nursery, run your computation, copy your result out to the GC'd heap at the end, and throw away the arena. Finally, I hope that someone will some day pick up the RTS work again. This might mean trying another approach to thread-local heaps, allowing multiple indepedent heaps per RTS, or even allowing multiple RTSs per process. It is sad that our parallelism story is limited in the ways that it is and there are still a number of stones that remain unturned. Cheers, - Ben [1] http://community.haskell.org/~simonmar/local-gc.pdf

On 12 May 2017 at 14:44, Ben Gamari
For this reason, if your program is truly embarassingly parallel, it is usually preferrable performance-wise to run multiple processes concurrently than use pure parallelism (a sad truth in a harsh reality, in my opinion).
Hmm, an interesting idea. It's not _all_ embarassingly parallel, but the bottleneck is. You're right, we could have separate front-end processes just doing the validation (indeed they could even be distributed across the network). I don't think we'll do that just yet, particularly not when 8.2's results are so promising, but it sounds a useful idea to keep in the toolbox. Thanks again, David

David Turner
On 12 May 2017 at 14:44, Ben Gamari
wrote: For this reason, if your program is truly embarassingly parallel, it is usually preferrable performance-wise to run multiple processes concurrently than use pure parallelism (a sad truth in a harsh reality, in my opinion).
Hmm, an interesting idea. It's not _all_ embarassingly parallel, but the bottleneck is. You're right, we could have separate front-end processes just doing the validation (indeed they could even be distributed across the network). I don't think we'll do that just yet, particularly not when 8.2's results are so promising, but it sounds a useful idea to keep in the toolbox.
Indeed. Note that Haskell is also quite good at distributed computation. StaticPointers makes it relatively easy to talk about code distributed across processes and machines. See, for instance, the Cloud Haskell suite of libraries. Cheers, - Ben

On Fri, 2017-05-12 at 10:27 +0200, Arjen wrote:
Morning all, On 11 May 2017 at 20:32, Ben Gamari
wrote: Something that I should have mentioned earlier is that STG has
On Fri, 2017-05-12 at 09:10 +0100, David Turner wrote: the
nice property that all allocation is syntactically obvious: allocated closures manifest as `let`s. This makes it fairly easy to pick out possible allocation sites, even in large dumps. Ah, that's very useful to know! Armed with that knowledge, I came to the conclusion that the allocation was for the sharing of the `nextState` variable. Inlining it brings it down to 20us and 22kB per iteration. https://github.com/DaveCTurner/json-validator/commit/ec994ec9226ca7 bc 2e76f19bef98f42e0b233524 Getting closer, but it looks like waiting for 8.2 is a better answer. Looking forward to it! Cheers,
Maybe this is a silly question, and please let me know why if so, but:
Has anyone thought about parallelizing it for multiple messages in order to "produce garbage faster"? While reducing allocation will make the single validations faster, doing multiple ones might improve the throughput per GC ratio. This assumes that the amount of live data in the heap is small, making GC sort of constant time, and having multiple cores available.
I wonder whether a few strategically placed par's and pseq's might allow you to scale horizontally without requiring invasive changes to the program's structure. Apologies for not trying to do this myself first ;-).
Unfortunately, this does not seem to help. I tried it and while I did see 3.5 and 3.95 CPU's used with parList and parListChunk 8, the timings show it to be slower than sequential evaluation of a list of testEvent. Thank you for the problem to try this on :-) Please let me know if I made a mistake in the testing code below. kind regards, Arjen ---8<--- benchmarking json-validator/Automaton/testEvent time 26.57 μs (26.07 μs .. 27.12 μs) 0.987 R² (0.973 R² .. 0.995 R²) mean 31.01 μs (28.17 μs .. 44.49 μs) std dev 15.65 μs (5.258 μs .. 35.77 μs) variance introduced by outliers: 99% (severely inflated) benchmarking json-validator/Automaton-List/testEventList time 65.11 ms (61.54 ms .. 69.54 ms) 0.982 R² (0.951 R² .. 0.996 R²) mean 59.18 ms (56.28 ms .. 63.22 ms) std dev 5.801 ms (4.344 ms .. 8.238 ms) variance introduced by outliers: 32% (moderately inflated) benchmarking json-validator/Automaton-ParList/testEventList time 243.9 ms (214.9 ms .. 270.6 ms) 0.996 R² (0.986 R² .. 1.000 R²) mean 253.7 ms (243.6 ms .. 260.8 ms) std dev 10.32 ms (6.781 ms .. 13.19 ms) variance introduced by outliers: 16% (moderately inflated) benchmarking json-validator/Automaton-ParListChunk/testEventList time 211.4 ms (193.1 ms .. 232.3 ms) 0.997 R² (0.990 R² .. 1.000 R²) mean 200.3 ms (193.0 ms .. 206.6 ms) std dev 9.256 ms (7.106 ms .. 10.21 ms) variance introduced by outliers: 14% (moderately inflated) 19,225,632,224 bytes allocated in the heap 109,968,376 bytes copied during GC 2,062,736 bytes maximum residency (20 sample(s)) 2,250,352 bytes maximum slop 10 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 28722 colls, 28722 par 17.829s 1.591s 0.0001s 0.0160s Gen 1 20 colls, 19 par 0.255s 0.054s 0.0027s 0.0058s Parallel GC work balance: 7.41% (serial 0%, perfect 100%) TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4) SPARKS: 25625 (25570 converted, 0 overflowed, 0 dud, 20 GC'd, 35 fizzled) INIT time 0.001s ( 0.002s elapsed) MUT time 44.403s ( 22.013s elapsed) GC time 18.084s ( 1.645s elapsed) EXIT time 0.001s ( 0.001s elapsed) Total time 62.490s ( 23.660s elapsed) Alloc rate 432,981,654 bytes per MUT second Productivity 71.1% of total user, 93.0% of total elapsed gc_alloc_block_sync: 25370 whitehole_spin: 0 gen[0].sync: 405 gen[1].sync: 22 ---8<--- {-# LANGUAGE OverloadedStrings #-} module Main where import Criterion.Main import Data.Aeson import qualified Data.ByteString.Lazy as BL import qualified Data.Text.Encoding as T import qualified Data.Text.IO as T import qualified Data.Text as T (append, pack) import qualified Automaton import Control.Parallel.Strategies testEvent :: Int -> BL.ByteString testEvent i = BL.fromStrict $ T.encodeUtf8 (T.append "{\"stream\":\"actuals- stream\",\"submitter\":{\"type\":\"other\",\"description\":\"redactedre dac\"},\"driverActivities\":[{\"driverActivity\":{\"journey\":{\"headco de\":\"1A01\",\"description\":\"redactedredactedredactedredactedredacte dredacte\"},\"activity\":[{\"arrivalTime\":null,\"sequence\":1,\"tiploc \":\"REDACTE\",\"stop\":true,\"departureTime\":\"2016-06- 09T18:22:28.000000000000Z\"},{\"arrivalTime\":\"2016-06- 09T18:24:43.000000000000Z\",\"sequence\":2,\"tiploc\":\"REDACTE\",\"sto p\":true,\"departureTime\":\"2016-06- 09T18:25:51.000000000000Z\"},{\"arrivalTime\":\"2016-06- 09T18:26:58.000000000000Z\",\"sequence\":3,\"tiploc\":\"REDACT\",\"stop \":true,\"departureTime\":\"2016-06- 09T18:28:08.000000000000Z\"},{\"arrivalTime\":\"2016-06- 09T18:29:57.000000000000Z\",\"sequence\":4,\"tiploc\":\"REDACTE\",\"sto p\":true,\"departureTime\":null}]},\"activityUserId\":\"521be60a-02f2- 4892-b468-c17d9c1c4fcf\"}],\"submissionTime\":\"2016-06- 09T18:36:45.831486000000Z\",\"type\":\"driverActivityLogged" (T.append (T.pack (show i)) "\"}")) data AnyJSON = AnyJSON deriving (Show, Eq) instance FromJSON AnyJSON where parseJSON _ = pure AnyJSON main :: IO () main = print (testEventList [0]) >> defaultMain [ bgroup "json-validator/Automaton" [ bench "testEvent" $ whnf Automaton.isValidJson (testEvent 0) ] , bgroup "json-validator/Automaton-List" [ bench "testEventList" $ whnf (\es -> and (testEventList es `using` rseq)) [1..1000] ] , bgroup "json-validator/Automaton-ParList" [ bench "testEventList" $ whnf (\es -> and (testEventList es `using` parList rseq)) [1..1000] ] , bgroup "json-validator/Automaton-ParListChunk" [ bench "testEventList" $ whnf (\es -> and (testEventList es `using` parListChunk 8 rseq)) [1..1000] ] ] where testEventList = map (Automaton.isValidJson . testEvent)
participants (4)
-
Arjen
-
Ben Gamari
-
Ben Gamari
-
David Turner