
Hello, I recently implemented the algorithm used by GHC's compacting GC in another project. The algorithm (after marking) makes two passes over the heap /generation. In GHC, these passes are implemented in [1] and in the next function. In my implementation I tried 3 ways of implementing these passes, one of which is the same as GHC's code, and benchmarked each version. I found that the fastest implementation is not what's used in GHC, but it could easily be used. I should mention that my code targets Wasm, and I benchmarked Wasm instructions executed. Not CPU cycles, CPU instructions, or anything like that. It's very likely that the results will be different when benchmarking code running on actual hardware. Secondly, in my case the heap is mostly dead (residency is low). In GHC, compaction for the oldest generation is enabled when residency passes a threshold, so the assumption is the heap is mostly live. I'm guessing this should also make some difference. Anyway, the first implementation I tried was similar to GHC's scan, but I did scan += object_size(scan); instead of bumping scan by one, as GHC does in [2]. This was the slowest version. Second implementation did the same as GHC (bumped scan by one). This was faster, but still slower than the next version. What I found to be the best is scanning the bitmap, not the heap. The bitmap iterator reads one word at a time. In each iteration it checks if the bitmap word is 0. In GHC, in the best case this can skip 16 words on heap on 32-bit systems, and 32 words on 64-bit. Reminder: we need two bits per object in the bitmap, see [3]. (this is not the case in my implementation so the payoff is better) When the bitmap word is not 0 I use "count trailing zeros" (or "count leading zeros" depending on the bitmap implementation) to get the number of words to skip. This is a single instruction on Wasm and x86 (TZCNT or LZCNT, available via __builtin_ctzl and __builtin_clzl in gcc). So instead of skipping one word at a time, this can potentially skip 16 words (or 32 on 64-bit architectures). When that's not possible, it can still skip multiple words by using ctz/clz. Ömer [1]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts... [2]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts... [3]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts...