Re: [Haskell-cafe] Fast JSON validation - reducing allocations

12 May 2017

      David Turner  writes:
...
On 12 May 2017 at 09:27, Arjen  wrote:
...
Maybe this is a silly question, and please let me know why if so, but:
Has anyone thought about parallelizing it for multiple messages in
order to "produce garbage faster"? While reducing allocation will make
the single validations faster, doing multiple ones might improve the
throughput per GC ratio. This assumes that the amount of live data in
the heap is small, making GC sort of constant time, and having multiple
cores available.
Not a silly question at all. Adding the following incantation:
`using` parListChunk 100 rseq
does quite happily spread things across all 4 cores on my development
machine, and it's certainly a bit faster. To give some stats, it processes
~24 events between GCs rather than ~6, and collects ~2MB rather than
~500kB. The throughput becomes a lot less consistent, however, at least
partly due to some bigger GC pauses along the way. As Ben's statistics
showed, our allocation rate on one thread is around 4TBps, which is already
stressing the GC out a bit, and parallelising it doesn't make that problem
any easier.
To elaborate on this a bit: in my experience allocation rate becomes
even more important for performance (both latency and throughput) when
introducing parallelism. The reason is that our GC is a stop-the-world
collector and consequently whenever one thread needs to GC, it must stop
and wait for all others to stop as well. This means that your program
will be synchronizing constantly, even if your threads are working on
completely non-intersecting sets of heap objects, greatly limiting your
usable parallelism. You can observe this behavior using the eventlog.

It is possible to tune the GC to trade latency for throughput without
changing your program. Increasing the size of the nursery will mean that
each thread will take longer to fill its nursery and consequently
require less frequent GC. However, this means that each minor GC will
take longer.

For this reason, if your program is truly embarassingly parallel, it is
usually preferrable performance-wise to run multiple processes 
concurrently than use pure parallelism (a sad truth in a harsh reality,
in my opinion).

Note that there have been efforts to help this issue in the past: Simon
Marlow has a paper [1] examining a GC scheme with thread-local heaps,
allowing threads to perform minor collections while other threads
continue to run. Unfortunately this introduced a significant amount of
complexity into the RTS which the performance numbers just didn't
justify.

The new compact regions feature also bears on this issue, as it gives
users tools for better managing the effect of their long-lived live data
on GC performance. That being said, this wouldn't help in your case as
you have little long-lived live data.

Some form of the linear types proposal also may some day offer some help
here as it could in principle allow alternatives to today's GC'd heap.
For instance, in your case you may want to use an arena allocation
strategy: when starting a document allocate an arena which will serve as
your nursery, run your computation, copy your result out to the GC'd
heap at the end, and throw away the arena.

Finally, I hope that someone will some day pick up the RTS work again.
This might mean trying another approach to thread-local heaps, allowing
multiple indepedent heaps per RTS, or even allowing multiple RTSs per
process. It is sad that our parallelism story is limited in the ways
that it is and there are still a number of stones that remain unturned.

Cheers,

 - Ben

[1] http://community.haskell.org/~simonmar/local-gc.pdf

Re: [Haskell-cafe] Fast JSON validation - reducing allocations

Ben Gamari