Potential improvement in compacting GC

Hello, I recently implemented the algorithm used by GHC's compacting GC in another project. The algorithm (after marking) makes two passes over the heap /generation. In GHC, these passes are implemented in [1] and in the next function. In my implementation I tried 3 ways of implementing these passes, one of which is the same as GHC's code, and benchmarked each version. I found that the fastest implementation is not what's used in GHC, but it could easily be used. I should mention that my code targets Wasm, and I benchmarked Wasm instructions executed. Not CPU cycles, CPU instructions, or anything like that. It's very likely that the results will be different when benchmarking code running on actual hardware. Secondly, in my case the heap is mostly dead (residency is low). In GHC, compaction for the oldest generation is enabled when residency passes a threshold, so the assumption is the heap is mostly live. I'm guessing this should also make some difference. Anyway, the first implementation I tried was similar to GHC's scan, but I did scan += object_size(scan); instead of bumping scan by one, as GHC does in [2]. This was the slowest version. Second implementation did the same as GHC (bumped scan by one). This was faster, but still slower than the next version. What I found to be the best is scanning the bitmap, not the heap. The bitmap iterator reads one word at a time. In each iteration it checks if the bitmap word is 0. In GHC, in the best case this can skip 16 words on heap on 32-bit systems, and 32 words on 64-bit. Reminder: we need two bits per object in the bitmap, see [3]. (this is not the case in my implementation so the payoff is better) When the bitmap word is not 0 I use "count trailing zeros" (or "count leading zeros" depending on the bitmap implementation) to get the number of words to skip. This is a single instruction on Wasm and x86 (TZCNT or LZCNT, available via __builtin_ctzl and __builtin_clzl in gcc). So instead of skipping one word at a time, this can potentially skip 16 words (or 32 on 64-bit architectures). When that's not possible, it can still skip multiple words by using ctz/clz. Ömer [1]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts... [2]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts... [3]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts...

Two other ideas that should improve GHC's compacting GC much more
significantly. I've implemented both of these in another project and the
results were great. In a GC benchmark (mutator does almost no work other than
allocating data using a bump allocator), first one reduced Wasm instructions
executed by 14%, second one 19.8%.
Both of these ideas require pushing object headers to the mark stack with the
objects, which means larger mark stacks. This is the only downside of these
algorithms.
- Instead of marking and then threading in the next pass, mark phase threads
all fields when pushing the fields to the mark stack. We still need two other
passes: one to unthread headers, another to move the objects. (we can't do
both in one pass, let me know if you're curious and I can elaborate)
This has the same number of passes as the current implementation, but it only
visits object fields once. Currently, we visit fields once when marking, to
mark fields, then again in `update_fwd`. This implementation does both in one
pass over the fields. `update_fwd` does not visit fields.
This reduced Wasm instructions executed by 14% in my benchmark.
- Marking phase threads backwards pointers (ignores forwards pointers). Then we
do one pass instead of two: for a marked object, unthread it (update
forwards pointers to the object's new location), move it to its new location,
then thread its forwards pointers. This completely eliminates one of the 3
passes, but fields need to be visited twice as before (unlike the algorithm
above).
I think this one is originally described in "An Efficient Garbage Compaction
Algorithm", but I found that paper to be difficult to follow. A short
description of the same algorithm exists in "High-Performance Garbage
Collection for Memory-Constrained Environments" in section 5.1.2.
This reduced Wasm instructions executed by 19.8% in my benchmark.
In this algorithm, fields that won't be moved can be threaded any time before
the second pass (pointed objects need to be marked and pushed to the mark
stack with headers before threading a field). For example, in GHC, mut list
entries can be threaded before or after marking (but before the second pass)
as IIRC mut lists are not moved. Same for fields of large objects.
As far as I can see, mark-compact GC is still the default when max heap size is
specified and the oldest generation size is (by default) more than 30% of the
max heap size. I'm not sure if max heap size is specified often (it's off by
default), so not sure what would be the impact of these improvements be, but if
anyone would be interested in funding me to implement these ideas (second
algorithm above, and the bitmap iteration in the previous email) I could try to
allocate one or two days a week to finish in a few months.
Normally these are simple changes, but it's difficult to test and debug GHC's
RTS as we don't have a test suite readily available and the code is not easily
testable. In my previous implementations of these algorithms I had unit tests
for the GC where I could easily generate arbitrary graphs (with cycles,
backwards ptrs, forwards ptrs, ptrs from/to roots etc.) and test GC in
isolation. Implementation of (2) took less than a day, and I didn't have to
debug it more once the tests passed. It's really unfortunate that GHC's RTS
makes this kind of thing difficult..
Ömer
Ömer Sinan Ağacan
Hello,
I recently implemented the algorithm used by GHC's compacting GC in another project. The algorithm (after marking) makes two passes over the heap /generation. In GHC, these passes are implemented in [1] and in the next function.
In my implementation I tried 3 ways of implementing these passes, one of which is the same as GHC's code, and benchmarked each version. I found that the fastest implementation is not what's used in GHC, but it could easily be used.
I should mention that my code targets Wasm, and I benchmarked Wasm instructions executed. Not CPU cycles, CPU instructions, or anything like that. It's very likely that the results will be different when benchmarking code running on actual hardware.
Secondly, in my case the heap is mostly dead (residency is low). In GHC, compaction for the oldest generation is enabled when residency passes a threshold, so the assumption is the heap is mostly live. I'm guessing this should also make some difference.
Anyway, the first implementation I tried was similar to GHC's scan, but I did
scan += object_size(scan);
instead of bumping scan by one, as GHC does in [2]. This was the slowest version.
Second implementation did the same as GHC (bumped scan by one). This was faster, but still slower than the next version.
What I found to be the best is scanning the bitmap, not the heap. The bitmap iterator reads one word at a time. In each iteration it checks if the bitmap word is 0. In GHC, in the best case this can skip 16 words on heap on 32-bit systems, and 32 words on 64-bit. Reminder: we need two bits per object in the bitmap, see [3]. (this is not the case in my implementation so the payoff is better)
When the bitmap word is not 0 I use "count trailing zeros" (or "count leading zeros" depending on the bitmap implementation) to get the number of words to skip. This is a single instruction on Wasm and x86 (TZCNT or LZCNT, available via __builtin_ctzl and __builtin_clzl in gcc).
So instead of skipping one word at a time, this can potentially skip 16 words (or 32 on 64-bit architectures). When that's not possible, it can still skip multiple words by using ctz/clz.
Ömer
[1]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts... [2]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts... [3]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts...

Thanks Omer
I had an interesting conversation with Steve Blackburn, the brains behind
the MMTk memory management toolkit recently
https://www.mmtk.io/
MMTk is designed to be a re-usable, open-source garbage collector, specifically
designed to be usable with lots of languages. In principle this is a great
idea: GC is such a big field that no runtime (GHC's included) can ever devote
enough effort to GC to do a really state of the art job. It makes sense for
one bunch of people to stellar GC and another bunch to simply reuse their
work.
Of course, the interface between the GC and the mutator, scheduler, etc
is particularly intimate. Teasing them apart in GHC would be a significant
task, and success would not be guaranteed.
But Steve is interested in working on this, with help from our end, perhaps
initially with a student (or volunteer) project or two.
If it worked, it'd be cool.
Here's a talk about MMTk: https://www.youtube.com/watch?v=3L6XEVaYAmU
Simon
| -----Original Message-----
| From: ghc-devs

Greetings! I'd like to take this opportunity to ask you experts about feasibility / technology-readiness of distributed GC, that I've recently been pondering with the idea to have a distributed GC managing shared heap across multiple server nodes, those inter-connected through fast ethernet. I'm implementing an array database system for internal use, each array is constrained to flat address space so no GC is required within one. But the number of arrays are unexpectedly large as my use cases becoming more apparent. Managing relations between those arrays (i.e. meta data) has appeared as far beyond the computational capacity of a single physical PC server (even powerful ones in today's market). We are working around this limitation by restricting meta data to be mappable to directory structure of underlying filesystem, but it's quite limiting from business perspective. We use FUSE filesystem driver to handle fine grained data coherence control over these many small arrays, to be accessed by many concurrent client machines, by exposing a large virtual file for mmap on each client, to efficiently leverage os kernel pages as good shared cache storage, backing multiple processes running on each client os. I'd imagine shared heap with https://hackage.haskell.org/package/compact https://hackage.haskell.org/package/compact and a GC to have structural meta data managed likewise, so each client can mmap the entire shared heap, but only load relevant kernel pages on demand of the data nodes as it explores the information graph. This way the native programming language assumes the role of Data Manipulation Language as well, and Query Language can be simply some optimized data structures wrapped with accessing lib functions, more importantly the database system will have great horizontal scalability. And the whole architecture will have little technical debt as it seems so far. So how feasible it is that you see? I'm aware that [compact] is still some experimental, and not even portable across GHC compilations yet, but it seems solvable with enough effort put in. Thanks, Compl
On 2021-07-14, at 16:12, Simon Peyton Jones via ghc-devs
wrote: Thanks Omer
I had an interesting conversation with Steve Blackburn, the brains behind the MMTk memory management toolkit recently https://www.mmtk.io/
MMTk is designed to be a re-usable, open-source garbage collector, specifically designed to be usable with lots of languages. In principle this is a great idea: GC is such a big field that no runtime (GHC's included) can ever devote enough effort to GC to do a really state of the art job. It makes sense for one bunch of people to stellar GC and another bunch to simply reuse their work.
Of course, the interface between the GC and the mutator, scheduler, etc is particularly intimate. Teasing them apart in GHC would be a significant task, and success would not be guaranteed.
But Steve is interested in working on this, with help from our end, perhaps initially with a student (or volunteer) project or two.
If it worked, it'd be cool.
Here's a talk about MMTk: https://www.youtube.com/watch?v=3L6XEVaYAmU
Simon
| -----Original Message----- | From: ghc-devs
On Behalf Of Ömer Sinan | Agacan | Sent: 14 July 2021 07:27 | To: ghc-devs | Subject: Re: Potential improvement in compacting GC | | Two other ideas that should improve GHC's compacting GC much more | significantly. I've implemented both of these in another project and | the results were great. In a GC benchmark (mutator does almost no work | other than allocating data using a bump allocator), first one reduced | Wasm instructions executed by 14%, second one 19.8%. | | Both of these ideas require pushing object headers to the mark stack | with the objects, which means larger mark stacks. This is the only | downside of these algorithms. | | - Instead of marking and then threading in the next pass, mark phase | threads | all fields when pushing the fields to the mark stack. We still need | two other | passes: one to unthread headers, another to move the objects. (we | can't do | both in one pass, let me know if you're curious and I can elaborate) | | This has the same number of passes as the current implementation, | but it only | visits object fields once. Currently, we visit fields once when | marking, to | mark fields, then again in `update_fwd`. This implementation does | both in one | pass over the fields. `update_fwd` does not visit fields. | | This reduced Wasm instructions executed by 14% in my benchmark. | | - Marking phase threads backwards pointers (ignores forwards | pointers). Then we | do one pass instead of two: for a marked object, unthread it (update | forwards pointers to the object's new location), move it to its new | location, | then thread its forwards pointers. This completely eliminates one of | the 3 | passes, but fields need to be visited twice as before (unlike the | algorithm | above). | | I think this one is originally described in "An Efficient Garbage | Compaction | Algorithm", but I found that paper to be difficult to follow. A | short | description of the same algorithm exists in "High-Performance | Garbage | Collection for Memory-Constrained Environments" in section 5.1.2. | | This reduced Wasm instructions executed by 19.8% in my benchmark. | | In this algorithm, fields that won't be moved can be threaded any | time before | the second pass (pointed objects need to be marked and pushed to the | mark | stack with headers before threading a field). For example, in GHC, | mut list | entries can be threaded before or after marking (but before the | second pass) | as IIRC mut lists are not moved. Same for fields of large objects. | | As far as I can see, mark-compact GC is still the default when max | heap size is specified and the oldest generation size is (by default) | more than 30% of the max heap size. I'm not sure if max heap size is | specified often (it's off by default), so not sure what would be the | impact of these improvements be, but if anyone would be interested in | funding me to implement these ideas (second algorithm above, and the | bitmap iteration in the previous email) I could try to allocate one or | two days a week to finish in a few months. | | Normally these are simple changes, but it's difficult to test and | debug GHC's RTS as we don't have a test suite readily available and | the code is not easily testable. In my previous implementations of | these algorithms I had unit tests for the GC where I could easily | generate arbitrary graphs (with cycles, backwards ptrs, forwards ptrs, | ptrs from/to roots etc.) and test GC in isolation. Implementation of | (2) took less than a day, and I didn't have to debug it more once the | tests passed. It's really unfortunate that GHC's RTS makes this kind | of thing difficult.. | | Ömer | | Ömer Sinan Ağacan , 7 Oca 2021 Per, 20:42 | tarihinde şunu yazdı: | > | > Hello, | > | > I recently implemented the algorithm used by GHC's compacting GC in | > another project. The algorithm (after marking) makes two passes over | > the heap /generation. In GHC, these passes are implemented in [1] | and | > in the next function. | > | > In my implementation I tried 3 ways of implementing these passes, | one | > of which is the same as GHC's code, and benchmarked each version. I | > found that the fastest implementation is not what's used in GHC, but | it could easily be used. | > | > I should mention that my code targets Wasm, and I benchmarked Wasm | > instructions executed. Not CPU cycles, CPU instructions, or anything | > like that. It's very likely that the results will be different when | > benchmarking code running on actual hardware. | > | > Secondly, in my case the heap is mostly dead (residency is low). In | > GHC, compaction for the oldest generation is enabled when residency | > passes a threshold, so the assumption is the heap is mostly live. | I'm | > guessing this should also make some difference. | > | > Anyway, the first implementation I tried was similar to GHC's scan, | > but I did | > | > scan += object_size(scan); | > | > instead of bumping scan by one, as GHC does in [2]. This was the | > slowest version. | > | > Second implementation did the same as GHC (bumped scan by one). This | > was faster, but still slower than the next version. | > | > What I found to be the best is scanning the bitmap, not the heap. | The | > bitmap iterator reads one word at a time. In each iteration it | checks | > if the bitmap word is 0. In GHC, in the best case this can skip 16 | > words on heap on 32-bit systems, and 32 words on 64-bit. Reminder: | we | > need two bits per object in the bitmap, see [3]. (this is not the | case | > in my implementation so the payoff is | > better) | > | > When the bitmap word is not 0 I use "count trailing zeros" (or | "count | > leading zeros" depending on the bitmap implementation) to get the | > number of words to skip. This is a single instruction on Wasm and | x86 | > (TZCNT or LZCNT, available via __builtin_ctzl and __builtin_clzl in | gcc). | > | > So instead of skipping one word at a time, this can potentially skip | > 16 words (or 32 on 64-bit architectures). When that's not possible, | it | > can still skip multiple words by using ctz/clz. | > | > Ömer | > | > [1]: | > | https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith | > | ub.com%2Fghc%2Fghc%2Fblob%2Fbd877edd9499a351db947cd51ed583872b2facdf%2 | > Frts%2Fsm%2FCompact.c%23L824- | L879&data=04%7C01%7Csimonpj%40microso | > | ft.com%7C58ade0545503419b747d08d9469092de%7C72f988bf86f141af91ab2d7cd0 | > | 11db47%7C1%7C0%7C637618409054020073%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC | > | 4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sd | > ata=H3RvKMPjE%2BvQExIgu5HRudVAZ20YWcPonrLFKnMbTYI%3D&reserved=0 | > [2]: | > | https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith | > | ub.com%2Fghc%2Fghc%2Fblob%2Fbd877edd9499a351db947cd51ed583872b2facdf%2 | > | Frts%2Fsm%2FCompact.c%23L838&data=04%7C01%7Csimonpj%40microsoft.co | > | m%7C58ade0545503419b747d08d9469092de%7C72f988bf86f141af91ab2d7cd011db4 | > | 7%7C1%7C0%7C637618409054020073%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA | > | wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=t | > VqznK82Q9rs%2F2jpONLFbzhfVmUQ2sr4mIsCH2cxKAc%3D&reserved=0 | > [3]: | > | https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith | > | ub.com%2Fghc%2Fghc%2Fblob%2Fbd877edd9499a351db947cd51ed583872b2facdf%2 | > Frts%2Fsm%2FCompact.h%23L18- | L55&data=04%7C01%7Csimonpj%40microsoft | > | .com%7C58ade0545503419b747d08d9469092de%7C72f988bf86f141af91ab2d7cd011 | > | db47%7C1%7C0%7C637618409054020073%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w | > | LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdat | > a=uHwS5OpCRIF7UZd92i05SKnl0y1ZK2UojgATLxm7WHc%3D&reserved=0 | _______________________________________________ | ghc-devs mailing list | ghc-devs@haskell.org | https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail. | haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fghc- | devs&data=04%7C01%7Csimonpj%40microsoft.com%7C58ade0545503419b747d | 08d9469092de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637618409054 | 020073%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJ | BTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xCUxl2w2wsIsKDenizmmNh6pE | LHCQhbdhJIn%2B5tTDps%3D&reserved=0 _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Here's another improvement that fixes a very old issue in GHC's compacting GC
[1].
To summarize the problem: when untreading an object we update references to the
object that we've seen so far to the object's new location. But to get the
object's new location we need to know the object's size, because depending on
the size we may need to move the object to a new block (when the current block
does not have enough space for the object).
For this we currently walk the thread twice, once to get the info table (used
to get the size), then again to update references to the object. Ideally we
want to do just one pass when unthreading.
The solution is explained in The GC Handbook, section 3.4. Instead of using one
bit to mark an object, we use two bits: one for the first word of the object,
one for the last word. Using two bits is not a problem in GHC because heap
objects are at least 2 words. For example, an object with two words is marked
with `11`, 3 words is marked with `101` and so on.
Now we can scan the bitmap to find object size, and unthread it without having
to find the info table first.
Ömer
[1]: https://github.com/ghc/ghc/blob/922c6bc8dd8d089cfe4b90ec2120cb48959ba2b5/rts...
Ömer Sinan Ağacan
Two other ideas that should improve GHC's compacting GC much more significantly. I've implemented both of these in another project and the results were great. In a GC benchmark (mutator does almost no work other than allocating data using a bump allocator), first one reduced Wasm instructions executed by 14%, second one 19.8%.
Both of these ideas require pushing object headers to the mark stack with the objects, which means larger mark stacks. This is the only downside of these algorithms.
- Instead of marking and then threading in the next pass, mark phase threads all fields when pushing the fields to the mark stack. We still need two other passes: one to unthread headers, another to move the objects. (we can't do both in one pass, let me know if you're curious and I can elaborate)
This has the same number of passes as the current implementation, but it only visits object fields once. Currently, we visit fields once when marking, to mark fields, then again in `update_fwd`. This implementation does both in one pass over the fields. `update_fwd` does not visit fields.
This reduced Wasm instructions executed by 14% in my benchmark.
- Marking phase threads backwards pointers (ignores forwards pointers). Then we do one pass instead of two: for a marked object, unthread it (update forwards pointers to the object's new location), move it to its new location, then thread its forwards pointers. This completely eliminates one of the 3 passes, but fields need to be visited twice as before (unlike the algorithm above).
I think this one is originally described in "An Efficient Garbage Compaction Algorithm", but I found that paper to be difficult to follow. A short description of the same algorithm exists in "High-Performance Garbage Collection for Memory-Constrained Environments" in section 5.1.2.
This reduced Wasm instructions executed by 19.8% in my benchmark.
In this algorithm, fields that won't be moved can be threaded any time before the second pass (pointed objects need to be marked and pushed to the mark stack with headers before threading a field). For example, in GHC, mut list entries can be threaded before or after marking (but before the second pass) as IIRC mut lists are not moved. Same for fields of large objects.
As far as I can see, mark-compact GC is still the default when max heap size is specified and the oldest generation size is (by default) more than 30% of the max heap size. I'm not sure if max heap size is specified often (it's off by default), so not sure what would be the impact of these improvements be, but if anyone would be interested in funding me to implement these ideas (second algorithm above, and the bitmap iteration in the previous email) I could try to allocate one or two days a week to finish in a few months.
Normally these are simple changes, but it's difficult to test and debug GHC's RTS as we don't have a test suite readily available and the code is not easily testable. In my previous implementations of these algorithms I had unit tests for the GC where I could easily generate arbitrary graphs (with cycles, backwards ptrs, forwards ptrs, ptrs from/to roots etc.) and test GC in isolation. Implementation of (2) took less than a day, and I didn't have to debug it more once the tests passed. It's really unfortunate that GHC's RTS makes this kind of thing difficult..
Ömer
Ömer Sinan Ağacan
, 7 Oca 2021 Per, 20:42 tarihinde şunu yazdı: Hello,
I recently implemented the algorithm used by GHC's compacting GC in another project. The algorithm (after marking) makes two passes over the heap /generation. In GHC, these passes are implemented in [1] and in the next function.
In my implementation I tried 3 ways of implementing these passes, one of which is the same as GHC's code, and benchmarked each version. I found that the fastest implementation is not what's used in GHC, but it could easily be used.
I should mention that my code targets Wasm, and I benchmarked Wasm instructions executed. Not CPU cycles, CPU instructions, or anything like that. It's very likely that the results will be different when benchmarking code running on actual hardware.
Secondly, in my case the heap is mostly dead (residency is low). In GHC, compaction for the oldest generation is enabled when residency passes a threshold, so the assumption is the heap is mostly live. I'm guessing this should also make some difference.
Anyway, the first implementation I tried was similar to GHC's scan, but I did
scan += object_size(scan);
instead of bumping scan by one, as GHC does in [2]. This was the slowest version.
Second implementation did the same as GHC (bumped scan by one). This was faster, but still slower than the next version.
What I found to be the best is scanning the bitmap, not the heap. The bitmap iterator reads one word at a time. In each iteration it checks if the bitmap word is 0. In GHC, in the best case this can skip 16 words on heap on 32-bit systems, and 32 words on 64-bit. Reminder: we need two bits per object in the bitmap, see [3]. (this is not the case in my implementation so the payoff is better)
When the bitmap word is not 0 I use "count trailing zeros" (or "count leading zeros" depending on the bitmap implementation) to get the number of words to skip. This is a single instruction on Wasm and x86 (TZCNT or LZCNT, available via __builtin_ctzl and __builtin_clzl in gcc).
So instead of skipping one word at a time, this can potentially skip 16 words (or 32 on 64-bit architectures). When that's not possible, it can still skip multiple words by using ctz/clz.
Ömer
[1]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts... [2]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts... [3]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts...

Hey Ömer,
Just in case you are wondering whether you are talking into the void: you aren't! Keep the good suggestions coming, someone will eventually be able to get around to implementing them!
Thanks for your write-ups.
Cheers,
Sebastian
________________________________
Von: ghc-devs
Two other ideas that should improve GHC's compacting GC much more significantly. I've implemented both of these in another project and the results were great. In a GC benchmark (mutator does almost no work other than allocating data using a bump allocator), first one reduced Wasm instructions executed by 14%, second one 19.8%.
Both of these ideas require pushing object headers to the mark stack with the objects, which means larger mark stacks. This is the only downside of these algorithms.
- Instead of marking and then threading in the next pass, mark phase threads all fields when pushing the fields to the mark stack. We still need two other passes: one to unthread headers, another to move the objects. (we can't do both in one pass, let me know if you're curious and I can elaborate)
This has the same number of passes as the current implementation, but it only visits object fields once. Currently, we visit fields once when marking, to mark fields, then again in `update_fwd`. This implementation does both in one pass over the fields. `update_fwd` does not visit fields.
This reduced Wasm instructions executed by 14% in my benchmark.
- Marking phase threads backwards pointers (ignores forwards pointers). Then we do one pass instead of two: for a marked object, unthread it (update forwards pointers to the object's new location), move it to its new location, then thread its forwards pointers. This completely eliminates one of the 3 passes, but fields need to be visited twice as before (unlike the algorithm above).
I think this one is originally described in "An Efficient Garbage Compaction Algorithm", but I found that paper to be difficult to follow. A short description of the same algorithm exists in "High-Performance Garbage Collection for Memory-Constrained Environments" in section 5.1.2.
This reduced Wasm instructions executed by 19.8% in my benchmark.
In this algorithm, fields that won't be moved can be threaded any time before the second pass (pointed objects need to be marked and pushed to the mark stack with headers before threading a field). For example, in GHC, mut list entries can be threaded before or after marking (but before the second pass) as IIRC mut lists are not moved. Same for fields of large objects.
As far as I can see, mark-compact GC is still the default when max heap size is specified and the oldest generation size is (by default) more than 30% of the max heap size. I'm not sure if max heap size is specified often (it's off by default), so not sure what would be the impact of these improvements be, but if anyone would be interested in funding me to implement these ideas (second algorithm above, and the bitmap iteration in the previous email) I could try to allocate one or two days a week to finish in a few months.
Normally these are simple changes, but it's difficult to test and debug GHC's RTS as we don't have a test suite readily available and the code is not easily testable. In my previous implementations of these algorithms I had unit tests for the GC where I could easily generate arbitrary graphs (with cycles, backwards ptrs, forwards ptrs, ptrs from/to roots etc.) and test GC in isolation. Implementation of (2) took less than a day, and I didn't have to debug it more once the tests passed. It's really unfortunate that GHC's RTS makes this kind of thing difficult..
Ömer
Ömer Sinan Ağacan
, 7 Oca 2021 Per, 20:42 tarihinde şunu yazdı: Hello,
I recently implemented the algorithm used by GHC's compacting GC in another project. The algorithm (after marking) makes two passes over the heap /generation. In GHC, these passes are implemented in [1] and in the next function.
In my implementation I tried 3 ways of implementing these passes, one of which is the same as GHC's code, and benchmarked each version. I found that the fastest implementation is not what's used in GHC, but it could easily be used.
I should mention that my code targets Wasm, and I benchmarked Wasm instructions executed. Not CPU cycles, CPU instructions, or anything like that. It's very likely that the results will be different when benchmarking code running on actual hardware.
Secondly, in my case the heap is mostly dead (residency is low). In GHC, compaction for the oldest generation is enabled when residency passes a threshold, so the assumption is the heap is mostly live. I'm guessing this should also make some difference.
Anyway, the first implementation I tried was similar to GHC's scan, but I did
scan += object_size(scan);
instead of bumping scan by one, as GHC does in [2]. This was the slowest version.
Second implementation did the same as GHC (bumped scan by one). This was faster, but still slower than the next version.
What I found to be the best is scanning the bitmap, not the heap. The bitmap iterator reads one word at a time. In each iteration it checks if the bitmap word is 0. In GHC, in the best case this can skip 16 words on heap on 32-bit systems, and 32 words on 64-bit. Reminder: we need two bits per object in the bitmap, see [3]. (this is not the case in my implementation so the payoff is better)
When the bitmap word is not 0 I use "count trailing zeros" (or "count leading zeros" depending on the bitmap implementation) to get the number of words to skip. This is a single instruction on Wasm and x86 (TZCNT or LZCNT, available via __builtin_ctzl and __builtin_clzl in gcc).
So instead of skipping one word at a time, this can potentially skip 16 words (or 32 on 64-bit architectures). When that's not possible, it can still skip multiple words by using ctz/clz.
Ömer
[1]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts... [2]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts... [3]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts...
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Ömer Sinan Ağacan
Here's another improvement that fixes a very old issue in GHC's compacting GC [1].
To summarize the problem: when untreading an object we update references to the object that we've seen so far to the object's new location. But to get the object's new location we need to know the object's size, because depending on the size we may need to move the object to a new block (when the current block does not have enough space for the object).
For this we currently walk the thread twice, once to get the info table (used to get the size), then again to update references to the object. Ideally we want to do just one pass when unthreading.
The solution is explained in The GC Handbook, section 3.4. Instead of using one bit to mark an object, we use two bits: one for the first word of the object, one for the last word. Using two bits is not a problem in GHC because heap objects are at least 2 words. For example, an object with two words is marked with `11`, 3 words is marked with `101` and so on.
Now we can scan the bitmap to find object size, and unthread it without having to find the info table first.
Thanks Ömer! I have plopped all of these ideas into a ticket, #20328. Hopefully someone will come along to implement. Cheers, - Ben
participants (5)
-
Ben Gamari
-
Sebastian Graf
-
Simon Peyton Jones
-
YueCompl
-
Ömer Sinan Ağacan