
David Turner
On 12 May 2017 at 09:27, Arjen
wrote: Maybe this is a silly question, and please let me know why if so, but:
Has anyone thought about parallelizing it for multiple messages in order to "produce garbage faster"? While reducing allocation will make the single validations faster, doing multiple ones might improve the throughput per GC ratio. This assumes that the amount of live data in the heap is small, making GC sort of constant time, and having multiple cores available.
Not a silly question at all. Adding the following incantation:
`using` parListChunk 100 rseq
does quite happily spread things across all 4 cores on my development machine, and it's certainly a bit faster. To give some stats, it processes ~24 events between GCs rather than ~6, and collects ~2MB rather than ~500kB. The throughput becomes a lot less consistent, however, at least partly due to some bigger GC pauses along the way. As Ben's statistics showed, our allocation rate on one thread is around 4TBps, which is already stressing the GC out a bit, and parallelising it doesn't make that problem any easier.
To elaborate on this a bit: in my experience allocation rate becomes even more important for performance (both latency and throughput) when introducing parallelism. The reason is that our GC is a stop-the-world collector and consequently whenever one thread needs to GC, it must stop and wait for all others to stop as well. This means that your program will be synchronizing constantly, even if your threads are working on completely non-intersecting sets of heap objects, greatly limiting your usable parallelism. You can observe this behavior using the eventlog. It is possible to tune the GC to trade latency for throughput without changing your program. Increasing the size of the nursery will mean that each thread will take longer to fill its nursery and consequently require less frequent GC. However, this means that each minor GC will take longer. For this reason, if your program is truly embarassingly parallel, it is usually preferrable performance-wise to run multiple processes concurrently than use pure parallelism (a sad truth in a harsh reality, in my opinion). Note that there have been efforts to help this issue in the past: Simon Marlow has a paper [1] examining a GC scheme with thread-local heaps, allowing threads to perform minor collections while other threads continue to run. Unfortunately this introduced a significant amount of complexity into the RTS which the performance numbers just didn't justify. The new compact regions feature also bears on this issue, as it gives users tools for better managing the effect of their long-lived live data on GC performance. That being said, this wouldn't help in your case as you have little long-lived live data. Some form of the linear types proposal also may some day offer some help here as it could in principle allow alternatives to today's GC'd heap. For instance, in your case you may want to use an arena allocation strategy: when starting a document allocate an arena which will serve as your nursery, run your computation, copy your result out to the GC'd heap at the end, and throw away the arena. Finally, I hope that someone will some day pick up the RTS work again. This might mean trying another approach to thread-local heaps, allowing multiple indepedent heaps per RTS, or even allowing multiple RTSs per process. It is sad that our parallelism story is limited in the ways that it is and there are still a number of stones that remain unturned. Cheers, - Ben [1] http://community.haskell.org/~simonmar/local-gc.pdf