Garbage collecting pointers

Hi
For some time I have been thinking about an idea, which could limit
Haskell's memory footprint. I don't know if the idea is crazy or clever,
but I would love to hear peoples thoughts about it. The short story is,
I propose that the garbage collector should not just reclaim unused
memory, it should also diminish the need for pointers, by replacing
nested data structures with larger chunks of consecutive memory. In
other words, I would diminish the need for pointers for arbitrary
recursive data types, just as replacing linked lists with arrays
eliminate the need for pointers.
I will explain my idea by an example of a data type we all know and
love:
data List a = Cons a (List a) | Nil
each Cons cell uses two pointers - one for the element and one for the
rest of the list. If we could somehow merge the element and the rest of
the list into consecutive memory, we would be able to eliminate the
pointer to the rest of list. On 64 bit architectures merging would save
us 8 bytes of "wasted" memory. If we could merge n elements together we
could save n*8 bytes of memory.
64 bit pointers are wasteful in another way, as nobody has anywhere near
2^64 bytes of memory. And nobody is going to have that much memory for
decades to come. Thus, we could use the 8 most significant bits for
something besides pointing to memory, without loosing any significant
ability to address our memories. This would be similar to how GHC uses
some of the least significant bits for pointer tagging.
I suggest using the 8 most significant bits for what could be termed the
length of the pointer. A length of zero would mean that the pointer,
pointed to a Cons-cell just like we have them today. A length of one
would mean that one extra element were embedded into the cons cell. That
is, we would have this block of consecutive memory:
<Cons marker>

On Mar 26, 2010, at 16:28 , Mads Lindstrøm wrote:
For some time I have been thinking about an idea, which could limit Haskell's memory footprint. I don't know if the idea is crazy or clever,
This is called pointer tagging. The original STG design avoided it because of the perceived performance loss in removing the tags before using the pointer, but as I understand it the current GHC design uses limited tagging. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Fri, Mar 26, 2010 at 9:21 PM, Brandon S. Allbery KF8NH < allbery@ece.cmu.edu> wrote:
On Mar 26, 2010, at 16:28 , Mads Lindstrøm wrote:
For some time I have been thinking about an idea, which could limit Haskell's memory footprint. I don't know if the idea is crazy or clever,
This is called pointer tagging. The original STG design avoided it because of the perceived performance loss in removing the tags before using the pointer, but as I understand it the current GHC design uses limited tagging.
I don't think that's what he's talking about. He's saying the data that would normally be on the other side of the pointer would instead be stored "inline", if I understand him correctly. -- Sebastian Sylvan

Hi On Fri, 2010-03-26 at 21:24 +0000, Sebastian Sylvan wrote:
On Fri, Mar 26, 2010 at 9:21 PM, Brandon S. Allbery KF8NH
wrote: On Mar 26, 2010, at 16:28 , Mads Lindstrøm wrote: For some time I have been thinking about an idea, which could limit Haskell's memory footprint. I don't know if the idea is crazy or clever, This is called pointer tagging. The original STG design avoided it because of the perceived performance loss in removing the tags before using the pointer, but as I understand it the current GHC design uses limited tagging.
I don't think that's what he's talking about. He's saying the data that would normally be on the other side of the pointer would instead be stored "inline", if I understand him correctly.
Yes, that was what I tried to convey. Thanks, for clarifying for me. /Mads
-- Sebastian Sylvan

On Fri, Mar 26, 2010 at 8:28 PM, Mads Lindstrøm
Hi
For some time I have been thinking about an idea, which could limit Haskell's memory footprint. I don't know if the idea is crazy or clever, but I would love to hear peoples thoughts about it. The short story is, I propose that the garbage collector should not just reclaim unused memory, it should also diminish the need for pointers, by replacing nested data structures with larger chunks of consecutive memory. In other words, I would diminish the need for pointers for arbitrary recursive data types, just as replacing linked lists with arrays eliminate the need for pointers.
I will explain my idea by an example of a data type we all know and love:
data List a = Cons a (List a) | Nil
each Cons cell uses two pointers - one for the element and one for the rest of the list. If we could somehow merge the element and the rest of the list into consecutive memory, we would be able to eliminate the pointer to the rest of list. On 64 bit architectures merging would save us 8 bytes of "wasted" memory. If we could merge n elements together we could save n*8 bytes of memory.
64 bit pointers are wasteful in another way, as nobody has anywhere near 2^64 bytes of memory. And nobody is going to have that much memory for decades to come. Thus, we could use the 8 most significant bits for something besides pointing to memory, without loosing any significant ability to address our memories. This would be similar to how GHC uses some of the least significant bits for pointer tagging.
I suggest using the 8 most significant bits for what could be termed the length of the pointer. A length of zero would mean that the pointer, pointed to a Cons-cell just like we have them today. A length of one would mean that one extra element were embedded into the cons cell. That is, we would have this block of consecutive memory:
<Cons marker>
<Cons marker> <pointer to rest of list> We could also call this a chunk of length two. A pointer of length two, would mean two embedded list elements (a chunk of length three). And so on...
By embedding the "length" of the list in the pointer, it becomes possible that one pointer could point to the beginning of a chunk of n consecutive elements, while another one could point to the middle of the same chunk (but the pointer would be of a different length). This would enable sharing without having to break up into smaller chunks.
How should chunks be created? Or if a functions creates a lot of one long chunks, how are they turned into longer chunks. I imagine that it would be the job of the garbage collector, to merge the different elements in a list into larger chunks. Of cause some code, like reading a file, could create chunks directly.
I have used list as an example. But I imagine that GHC could do this for any recursively defined data type.
I can see few, but big disadvantages also. Mainly, my scheme would complicate dereferencing a pointer, which make the compiled code larger and it might make branch prediction harder. The latter would only be a problem on architectures that are pipelined, but popular processors from Intel and AMD are both heavily pipelined.
Reorganizing data on the fly sounds like it may be a pretty sensible idea now that cache misses are so bad (in comparison). The fact that Haskell data is generally immutable helps too. However, I think your scheme sounds a bit complicated for my tastes, and I don't really understand the idea of "lengths", what would "n consecutive elements" mean for a tree? Surely this wouldn't work just for list-like data structures? Something that seems a bit easier to work with would be to use just two data layouts, one which uses pointers everywhere like now, and one in which every single field is expanded one level (each of them could in turn be expanded one more level, but you don't know that until you look at it). If one of the fields can't be expanded (e.g. because it would become cyclic) then you just have to keep the "fully pointerfied" version. It's still a bit tricky because the offset to a specific member would be depend on the specific sizes of the previous fields in the layout (which can vary depending on the actual value). It would always possible to figure it out at runtime, though, by scanning through the fields each time finding the next one by incrementing the size, and you could just disable this feature for large data types. You could potentially store a heavily quantized lookup table (possibly at the expense of padding some members to align all members on a large common multiple from the base). The general idea seems interesting to me, but then I'm hardly an expert. I like the notion of using the fact that the language is free to change the data representation at runtime to improve performance by "collapsing" an immutable (recursive) data structure once it's been created. Partly because it's something that would be harder to do in other languages! -- Sebastian Sylvan

Hi On Fri, 2010-03-26 at 21:33 +0000, Sebastian Sylvan wrote:
Reorganizing data on the fly sounds like it may be a pretty sensible idea now that cache misses are so bad (in comparison). The fact that Haskell data is generally immutable helps too. However, I think your scheme sounds a bit complicated for my tastes, and I don't really understand the idea of "lengths", what would "n consecutive elements" mean for a tree? Surely this wouldn't work just for list-like data structures?
First of all, I had assumed that a data type's memory footprint was independent of its value. Similarly to how a C-union always occupies the same amount of memory. After reading your post, I reevaluated my assumption and my assumption might just be wrong. However, if my assumption is right we could lay out the memory in a preorder fashion. The following tree: a / \ / \ b c / \ / \ d e f g would have the following memory layout: a, b, d, e, c, f, g. A pointer to element "a" would be of length 6 (a + 6 additional elements). The preorder layout would also allow us to point to part of the structure. For example, we could make a pointer to element "b" of length 2 (b + 2 additional elements). In this way, we do not need to brake a larger chunk into smaller ones to enable sharing of data. Neither would we need to scan though the hole structure, when only pointing to a part of the structure.
Something that seems a bit easier to work with would be to use just two data layouts, one which uses pointers everywhere like now, and one in which every single field is expanded one level (each of them could in turn be expanded one more level, but you don't know that until you look at it). If one of the fields can't be expanded (e.g. because it would become cyclic) then you just have to keep the "fully pointerfied" version.
It's still a bit tricky because the offset to a specific member would be depend on the specific sizes of the previous fields in the layout (which can vary depending on the actual value). It would always possible to figure it out at runtime, though, by scanning through the fields each time finding the next one by incrementing the size, and you could just disable this feature for large data types. You could potentially store a heavily quantized lookup table (possibly at the expense of padding some members to align all members on a large common multiple from the base).
Would scanning not run the risk of changing the asymptotic complexity of programs? What do "heavily quantized lookup table" mean? It is the "heavily quantized" part I do not get.
The general idea seems interesting to me, but then I'm hardly an expert. I like the notion of using the fact that the language is free to change the data representation at runtime to improve performance by "collapsing" an immutable (recursive) data structure once it's been created. Partly because it's something that would be harder to do in other languages!
-- Sebastian Sylvan
/Mads

On Fri, Mar 26, 2010 at 10:52 PM, Mads Lindstrøm
Hi
On Fri, 2010-03-26 at 21:33 +0000, Sebastian Sylvan wrote:
Reorganizing data on the fly sounds like it may be a pretty sensible idea now that cache misses are so bad (in comparison). The fact that Haskell data is generally immutable helps too. However, I think your scheme sounds a bit complicated for my tastes, and I don't really understand the idea of "lengths", what would "n consecutive elements" mean for a tree? Surely this wouldn't work just for list-like data structures?
First of all, I had assumed that a data type's memory footprint was independent of its value. Similarly to how a C-union always occupies the same amount of memory. After reading your post, I reevaluated my assumption and my assumption might just be wrong.
Take the Maybe data type. The Nothing value occupies no space other than the Nothing-tag, the Just part occupies more (potentially a lot more, if it too has been collapsed!). It's conceivable that you would force the size to be the same a la C' unions, and have some thresholding where if one of the possible alternatives is too big then it would always use a pointer for that one alternative, to minimize padding.
What do "heavily quantized lookup table" mean? It is the "heavily quantized" part I do not get.
E.g. store a small number of bits of offset information per field in a record at the top, let's say 8 bits per field. Now, that means you can have 256 unique values, and therefore fields, but it nothing says that the offset value needs to refer to "bytes", it could just as easily refer to "multiples of 16 bytes" for larger structures. This means you would need to align all your fields to 16-byte boundaries, which may need padding, but lets you refer to fields with a larger offset without using a full pointer per field. Maybe your gain is wasted by doing too much logic here. Perhaps a more conservative suggestion would be to keep the pointers, so the runtime code is the same, but store a little tag indicating that a value has an optional memory extent for the purposes of GC. This way you could compact the leaves of a data graph (still keeping the pointers), which lets the GC stop following pointers once it hits the tag and reads the memory extent saying "this is a some data totalling 612 bytes, and I promise that any pointers in this memory block are all internal to the block". You'd reduce GC work, and more importantly cache misses, but not overall memory usage. -- Sebastian Sylvan

I have to Point out that any such scheme as is being described would
need to be done quite carefully as to not break pass by reference data
semantics that Haskell enjoys/ the wealth of sharing
Moreover, as I understand it, something like this only is feasible in
general for statically sized data structures (though there has been
research on sml nj in the 90s that I thnk was in the phd thesis of
zhong shao I believe that detailed an evaluation of an approach to
chunking cons cells of lists into contiguous array like chunks), and
that in general any approach to this sort of cleverness would require
a lot more dictionary passing going on than Haskell already deals
with.
Additionally, Im fairly certain both that any non heuristic approach
to such optimization would quickly hit some computational
intractibility ala knapsack and also play merry hell with satisfying
memory alignment constaints required by some ABIs and encouraged by
some processors.
On Friday, March 26, 2010, Sebastian Sylvan
On Fri, Mar 26, 2010 at 10:52 PM, Mads Lindstrøm
wrote: Hi
On Fri, 2010-03-26 at 21:33 +0000, Sebastian Sylvan wrote:
Reorganizing data on the fly sounds like it may be a pretty sensible idea now that cache misses are so bad (in comparison). The fact that Haskell data is generally immutable helps too. However, I think your scheme sounds a bit complicated for my tastes, and I don't really understand the idea of "lengths", what would "n consecutive elements" mean for a tree? Surely this wouldn't work just for list-like data structures?
First of all, I had assumed that a data type's memory footprint was independent of its value. Similarly to how a C-union always occupies the same amount of memory. After reading your post, I reevaluated my assumption and my assumption might just be wrong.
Take the Maybe data type. The Nothing value occupies no space other than the Nothing-tag, the Just part occupies more (potentially a lot more, if it too has been collapsed!). It's conceivable that you would force the size to be the same a la C' unions, and have some thresholding where if one of the possible alternatives is too big then it would always use a pointer for that one alternative, to minimize padding.
What do "heavily quantized lookup table" mean? It is the "heavily quantized" part I do not get.
E.g. store a small number of bits of offset information per field in a record at the top, let's say 8 bits per field. Now, that means you can have 256 unique values, and therefore fields, but it nothing says that the offset value needs to refer to "bytes", it could just as easily refer to "multiples of 16 bytes" for larger structures. This means you would need to align all your fields to 16-byte boundaries, which may need padding, but lets you refer to fields with a larger offset without using a full pointer per field.
Maybe your gain is wasted by doing too much logic here. Perhaps a more conservative suggestion would be to keep the pointers, so the runtime code is the same, but store a little tag indicating that a value has an optional memory extent for the purposes of GC. This way you could compact the leaves of a data graph (still keeping the pointers), which lets the GC stop following pointers once it hits the tag and reads the memory extent saying "this is a some data totalling 612 bytes, and I promise that any pointers in this memory block are all internal to the block". You'd reduce GC work, and more importantly cache misses, but not overall memory usage.
-- Sebastian Sylvan

On 26/03/2010 20:28, Mads Lindstrøm wrote:
Hi
For some time I have been thinking about an idea, which could limit Haskell's memory footprint. I don't know if the idea is crazy or clever, but I would love to hear peoples thoughts about it. The short story is, I propose that the garbage collector should not just reclaim unused memory, it should also diminish the need for pointers, by replacing nested data structures with larger chunks of consecutive memory. In other words, I would diminish the need for pointers for arbitrary recursive data types, just as replacing linked lists with arrays eliminate the need for pointers.
I will explain my idea by an example of a data type we all know and love:
data List a = Cons a (List a) | Nil
each Cons cell uses two pointers - one for the element and one for the rest of the list. If we could somehow merge the element and the rest of the list into consecutive memory, we would be able to eliminate the pointer to the rest of list. On 64 bit architectures merging would save us 8 bytes of "wasted" memory. If we could merge n elements together we could save n*8 bytes of memory.
The trouble with techniques like this is that they break the uniformity of the representation, and complexity leaks all over the place. Various similar ideas have been tried in the past, though not with Haskell as far as I'm aware: CDR-coding and BiBOP spring to mind.
64 bit pointers are wasteful in another way, as nobody has anywhere near 2^64 bytes of memory. And nobody is going to have that much memory for decades to come. Thus, we could use the 8 most significant bits for something besides pointing to memory, without loosing any significant ability to address our memories. This would be similar to how GHC uses some of the least significant bits for pointer tagging.
Unfortunatley you don't know whereabouts in your address space your memory is located: the OS could do randomised allocation and give you a pages all over the place, so in fact you might need all 64 bits. Yes you can start doing OS-specific things, but that always leads to pain later (I know, I've been there). In the Java community there has been experimentation with using different representations to avoid the 64-bit pointer overhead: e.g. shifting a 32-bit pointer to the left by 2 bits in order to access 16GB of memory. Personally I'm not keen on doing this kind of trick, mainly because in GHC it would be a compile-time switch; Java has it easy here because they can make the choice at runtime. Also we already use the low 2 bits of pointers in GHC for pointer tagging. Cheers, Simon

On Mon, Mar 29, 2010 at 12:00 PM, Simon Marlow
On 26/03/2010 20:28, Mads Lindstrøm wrote:
Hi
For some time I have been thinking about an idea, which could limit Haskell's memory footprint. I don't know if the idea is crazy or clever, but I would love to hear peoples thoughts about it. The short story is, I propose that the garbage collector should not just reclaim unused memory, it should also diminish the need for pointers, by replacing nested data structures with larger chunks of consecutive memory. In other words, I would diminish the need for pointers for arbitrary recursive data types, just as replacing linked lists with arrays eliminate the need for pointers.
I will explain my idea by an example of a data type we all know and love:
data List a = Cons a (List a) | Nil
each Cons cell uses two pointers - one for the element and one for the rest of the list. If we could somehow merge the element and the rest of the list into consecutive memory, we would be able to eliminate the pointer to the rest of list. On 64 bit architectures merging would save us 8 bytes of "wasted" memory. If we could merge n elements together we could save n*8 bytes of memory.
The trouble with techniques like this is that they break the uniformity of the representation, and complexity leaks all over the place. Various similar ideas have been tried in the past, though not with Haskell as far as I'm aware: CDR-coding and BiBOP spring to mind.
While CDR-coding can introduce one hell of a complexity explosion if you're not careful, using an (optional) BiBOP-like representation for the tag isn't that bad. I've been experimenting with it in a lightweight Haskell-like language. If you associate with each page an optional tag for that page, and then when looking up the tag first check the page level. That way you can use a bump allocator initially, and as tags benchmark as being more common during garbage collection, you can dedicate pages to them. With GHC-style pointer tagging you rarely look at the actual tag anyways, so this extra cost is only incurred when forcing the thunk initially, or when the tag can't be applied (too many constructors). 64 bit pointers are wasteful in another way, as nobody has anywhere near
2^64 bytes of memory. And nobody is going to have that much memory for decades to come. Thus, we could use the 8 most significant bits for something besides pointing to memory, without loosing any significant ability to address our memories. This would be similar to how GHC uses some of the least significant bits for pointer tagging.
Unfortunatley you don't know whereabouts in your address space your memory is located: the OS could do randomised allocation and give you a pages all over the place, so in fact you might need all 64 bits. Yes you can start doing OS-specific things, but that always leads to pain later (I know, I've been there).
In the Java community there has been experimentation with using different representations to avoid the 64-bit pointer overhead: e.g. shifting a 32-bit pointer to the left by 2 bits in order to access 16GB of memory. Personally I'm not keen on doing this kind of trick, mainly because in GHC it would be a compile-time switch; Java has it easy here because they can make the choice at runtime. Also we already use the low 2 bits of pointers in GHC for pointer tagging.
The closest I've been able to get to making this scheme work is to use the MSB of the pointer as an 'evaluated' bit, and using 16 byte aligned 'CompressedOOPs' checking evaluation by something like: add eax, eax -- set carry based on evaluation flag, and double the base pointer jc evaluated -- otherwise fetch values using base+[eax*8]+slot Of course, knowing that something is evaluated is far less useful than knowing its actual tag, but the language in question lacks some of the niceties of Haskell with regards to optimization potential here. Though it does give access to a nice 32 gig heap at an arbitrary base address with 16 byte alignment, which makes it suitable to use SSE opcodes to manipulate thunks. The biggest problem with CompressedOOPs, for me, is the lack of linker support. You can ask for a segment to be given a chunk of memory below the 2 gig marker, but not below 4, let alone 32, so 0-based compressed OOPs can't be linked by the native linker. It wasn't until they added 0-based compressed OOPs that things finally started to perform for Java, and bringing your own linker into the mix is a decidedly unpleasant option. I'm continuing to play with them in a JIT-only environment but they don't play nice with compilation. -Edward Kmett
participants (6)
-
Brandon S. Allbery KF8NH
-
Carter Schonwald
-
Edward Kmett
-
Mads Lindstrøm
-
Sebastian Sylvan
-
Simon Marlow