
On Wed, Jul 14, 2010 at 10:35:50AM +0100, Simon Marlow wrote:
Yeah, I didn't realize how important the allocator was until I started benchmarking, spending time cutting the cost of marking garbage in half didn't help nearly as much as shaving a few cycles off the allocator. The fast pass of the allocator is actually very fast, each object type has its own allocator and free list so allocation is pretty much just pulling an object off of the free list, it is already of the appropriate size and its constant fields are pre-initialized as they arn't cleared during garbage collection. (there is a heuristic that claims full pages back to the general pool sometimes).
The slow path has to grab a full page and initialize it, but that isn't really much slower as I can prefetch the cache lines needed so the cost is on the order of another cache line fill. (thinking about computational complexity in terms of O(cache line fills) rather than O(operations) is much more appropriate on todays architectures.).
Right, you can see how important locality is by looking at these graphs that Don produced recently:
http://haskell.org/haskellwiki/Ghc-gc-tune
generational GC these days is important more for locality than for the benefits of avoiding repeated tracing.
Yeah, jhc's GC is currently not generational, it is certainly something I want to do, something I have considered as a quick hack was splitting up allocations based on whether they were updatable or not. allocating things that can statically be determined to be in HNF into an "older" generation and allocating updatable thunks in a younger one. The theory being updatable thunks are more likely to be overwritten by a indirection and become garbage soon, I was thinking this could give me some of the benefits of a generational collector without having to rewrite my GC around being able to promote objects to an older generation. I am not sure if it will work, but it is an idea I have been toying with. Implementing better heap profiling support for jhc generated code is a prerequisite in any case.
Speaking of prefetching, we get a lot of benefit in GHC from the automatic prefetching done by modern CPUs; I'm not sure how this would be affected by having multiple allocation regions. Manual prefetching is almost impossible to get right in my experience, see also Nethercote/Mycroft where they did some prefetching experiments with GHC:
Ah, this paper looks very interesting, I was wondering if you had experimented with prefetching just ahead of the allocation pointer. Looks like it helped :) John -- John Meacham - ⑆repetae.net⑆john⑈ - http://notanumber.net/