RTS/Garbage Collector idea

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 It's often bothered me that there is unneeded duplication in the heap, if you do something the compiler doesn't manage to optimize like f [a,b,c,d] = Foo bar [a,b,c,d] or (reverse . reverse) or using () or Int(eger) or ... or Bool boxes in multiple places in your program that have the same value. A referentially transparent difference in your program could be the difference between constant memory usage and a memory leak (I think -- although this proposal/idea will not "tie the knot" for you or deal well with cyclic data structures at all, in fact). I was thinking, since we have a copying garbage collector that reassigns all pointers anyway, could we detect the "identical" heap objects and for each set of identical ones, have all thunks that pointed to any of them, now point to the single one that is created in the target space? This might not have much effect. (and seems likely to be slow and possibly have a complex interaction with parallel and/or incremental GC...) Maybe it should not be done on _every_ garbage collection. First to implement/try might be just detecting those duplicates so we have some idea how much effect such an "optimization" would even have. (any idea how easy/possible that is (or just what the answer is;)?) Examples: evaluated (543::Int) somewhere ends up pointing to the same (543# :: Int#) in the heap as another (543::Int) somewhere does. 8:8:2:3:[] 9:9:2:3:[] The []s become shared, then the "3:[]"s, then the "2:3:[]"s, ending 8:8:x 9:9:x where x is 2:3:[] 8:8:2:3:(*) 9:9:2:3:(*) where (*) is the *same* object, maybe an unevaluated thunk, maybe some list. After (*)'s creation, two different parts of code decided to construct their own lists with that as their tail. Eventually becomes: 8:8:x 9:9:x where x is 2:3:(*) Two thunks that merely have the same code shouldn't be combined because it could be the [1..] [1..] sharing memory-leak to the extreme! It would seem safe to combine thunks that only produce something small like an Int, but they can actually produce a complex Exception - so, no combining of thunks. (also... must be careful in this area of breaking the IO monad-sequencing(maybe?), but I think it's quite safe as long as only already-evaluated heap objects are combined) I wonder how this interacts with polymorphic (list-nil of different types), or with merely structurally equivalent, data (if they're identical, merging shouldn't ever be a problem! at least as long as no-one relies on something like "if (unsafe pointer equality) then (same static type)"...) Isaac -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGdmJaHgcxvIWYTTURAjAhAJ9WHwzxuJPuqouol05K7+QHkT3IigCglOpw VtHCSxcy8E+w7OtCafU8pFM= =BqAi -----END PGP SIGNATURE-----

Isaac Dupree wrote:
I was thinking, since we have a copying garbage collector that reassigns all pointers anyway, could we detect the "identical" heap objects and for each set of identical ones, have all thunks that pointed to any of them, now point to the single one that is created in the target space?
I think what you're proposing is often called "hash consing", except that hash-consing is usually done at construction time, you want to do it at GC time. My take is it would only be worthwhile if there was a *lot* of sharing to be gained by doing it, and in most cases there wouldn't be. This is just a guess based on my experience poking around in the heap though - feel free to try it out and prove me wrong :-) Cheers, Simon

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Simon Marlow wrote:
I think what you're proposing is often called "hash consing", except that hash-consing is usually done at construction time, you want to do it at GC time.
My take is it would only be worthwhile if there was a *lot* of sharing to be gained by doing it, and in most cases there wouldn't be.
Sounds likely. I wonder if there's a theoretical limit on the amount of memory "wasted" by not doing this (maybe some function of the amount of memory needed after such compaction?)
This is just a guess based on my experience poking around in the heap though - feel free to try it out and prove me wrong :-)
if I feel like it, which seems a bit unlikely - might be useful as a last-ditch garbage collection technique, but the process probably needs nontrivial amount of memory itself so maybe not :-) Issac -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGdowTHgcxvIWYTTURApAWAKCUmB/HR+NzRMMjqpIie2r79gojlACfQ0Wm ei5l3/1JvNphTeUMV5Wg2uI= =u8/0 -----END PGP SIGNATURE-----

On Mon, Jun 18, 2007 at 02:00:02PM +0100, Simon Marlow wrote:
Isaac Dupree wrote:
I was thinking, since we have a copying garbage collector that reassigns all pointers anyway, could we detect the "identical" heap objects and for each set of identical ones, have all thunks that pointed to any of them, now point to the single one that is created in the target space?
I think what you're proposing is often called "hash consing", except that hash-consing is usually done at construction time, you want to do it at GC time.
My take is it would only be worthwhile if there was a *lot* of sharing to be gained by doing it, and in most cases there wouldn't be. This is just a guess based on my experience poking around in the heap though - feel free to try it out and prove me wrong :-)
it might be worthwhile to do in a couple specific cases, like a cache of small Ints or the ascii characters. John -- John Meacham - ⑆repetae.net⑆john⑈

On Thu, Jul 05, 2007 at 04:24:48PM -0700, John Meacham wrote:
On Mon, Jun 18, 2007 at 02:00:02PM +0100, Simon Marlow wrote:
Isaac Dupree wrote:
I was thinking, since we have a copying garbage collector that reassigns all pointers anyway, could we detect the "identical" heap objects and for each set of identical ones, have all thunks that pointed to any of them, now point to the single one that is created in the target space?
I think what you're proposing is often called "hash consing", except that hash-consing is usually done at construction time, you want to do it at GC time.
My take is it would only be worthwhile if there was a *lot* of sharing to be gained by doing it, and in most cases there wouldn't be. This is just a guess based on my experience poking around in the heap though - feel free to try it out and prove me wrong :-)
it might be worthwhile to do in a couple specific cases, like a cache of small Ints or the ascii characters.
Incidentally, GHC already does this, and has for so long that the optimization was described in section 7.3 of the original paper "Implementing lazy functional languages on stock hardware: The spineless tagless G-machine" (SPJ, 1992, JFP) (pages 48-49 on at least one copy). Stefan

On Tuesday 19 June 2007 01:00, Simon Marlow wrote:
Isaac Dupree wrote:
I was thinking, since we have a copying garbage collector that reassigns all pointers anyway, could we detect the "identical" heap objects and for each set of identical ones, have all thunks that pointed to any of them, now point to the single one that is created in the target space?
I think what you're proposing is often called "hash consing", except that hash-consing is usually done at construction time, you want to do it at GC time.
My take is it would only be worthwhile if there was a *lot* of sharing to be gained by doing it, and in most cases there wouldn't be. This is just a guess based on my experience poking around in the heap though - feel free to try it out and prove me wrong :-)
Presumably it's possible to do a piece-meal app-level hash consing of values by defining an appropriate identity memoisation. Though maybe the optimiser is so cunning it figures out you're returning an equal value and skips it. It'd be nice if there were system wide hash consing memoisations that could be called when substantial sharing was anticipated. Maybe an easy approximation is keeping your String trie, or whatever, in a global IORef. Or what about hash consing after an equality test? Less opportunity but perhaps less cost. Daniel

Hello Daniel, it seems like a good proposal for general-purpose library Friday, July 6, 2007, 4:12:49 AM, you wrote:
Presumably it's possible to do a piece-meal app-level hash consing of values by defining an appropriate identity memoisation. Though maybe the optimiser is so cunning it figures out you're returning an equal value and skips it.
It'd be nice if there were system wide hash consing memoisations that could be called when substantial sharing was anticipated. Maybe an easy approximation is keeping your String trie, or whatever, in a global IORef.
Or what about hash consing after an equality test? Less opportunity but perhaps less cost.
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (6)
-
Bulat Ziganshin
-
Daniel McAllansmith
-
Isaac Dupree
-
John Meacham
-
Simon Marlow
-
Stefan O'Rear