Re: [GHC] #9706: New block-structured heap organization for 64-bit

20 Oct 2014

      #9706: New block-structured heap organization for 64-bit
-------------------------------------+-------------------------------------
              Reporter:  ezyang      |            Owner:  simonmar
                  Type:  task        |           Status:  new
              Priority:  normal      |        Milestone:
             Component:  Runtime     |          Version:  7.8.3
  System                             |         Keywords:
            Resolution:              |     Architecture:  Unknown/Multiple
      Operating System:              |       Difficulty:  Unknown
  Unknown/Multiple                   |       Blocked By:
       Type of failure:              |  Related Tickets:
  None/Unknown                       |
             Test Case:              |
              Blocking:              |
Differential Revisions:              |
-------------------------------------+-------------------------------------

Comment (by ezyang):
...
in exchange for... what?
...
Also, as 32-bit architectures wane, would there be a simpler but
The primary benefits of reorganizing the code in this way are:

 * We get to eliminate the HEAP_ALLOCED check (#8199), which we've seen
 experimentally to improve performance by about 8% on GC heavy benchmarks,
 and even more for processes with large heaps. Additionally, we should see
 a (minor) further improvement, because we no longer have to spend a pile
 of time at startup copying static data to the heap or follow indirections
 in code (which is the case for the current patchset). We get to avoid
 committing a long and complicated patchset.
 * The HEAP_ALLOCED check gets modestly simpler; we still have to maintain
 32-bit megablock handling code, but we can eject most of the bookkeeping
 involved with managing 64-bit memory mapping
 * When we allocate data that spans multiple (today's) megablocks, we no
 longer have to waste the slop space afterwards. This has to be left empty
 today because the block descriptor was clobbered.

 perhaps-less-performant fallback that would allow the code to be
 simplified for all architectures?

 Unfortunately, the main concept (reserve a huge chunk of virtual address
 space) doesn't work at all on 32-bit, so I don't think it's possible to
 simplify the code at all here.
...
How do we decide what the max heap size should be? Total mem + swap?
I thought about this a bit; I suspect for performance reasons we will
 actually want this hard-coded into the runtime, since the size of this
 fragment is going to be built into the mask we do when doing the Bdescr
 calculation. It's not bad if HEAP_ALLOCED does a memory dereference to
 figure out the base address of the heap, but adding an extra memory
 dereference to Bdescr might be too much.
...
I'm worried about the overhead of having to pre-allocate a large number
 of page tables for the reserved address space
also
...
You mention that GCC Go does this. Is there any other prior art?
My understanding is this is pretty common for runtimes which use a
 contiguous heap. I did some quick looks and the Hotspot JVM runtime also
 does this:

 {{{
     // first reserve enough address space in advance since we want to be
     // able to break a single contiguous virtual address range into
 multiple
     // large page commits but WS2003 does not allow reserving large page
 space
     // so we just use 4K pages for reserve, this gives us a legal
 contiguous
     // address space. then we will deallocate that reservation, and re
 alloc
     // using large pages
     const size_t size_of_reserve = bytes + _large_page_size;
     if (bytes > size_of_reserve) {
       // Overflowed.
       warning("Individually allocated large pages failed, "
         "use -XX:-UseLargePagesIndividualAllocation to turn off");
       return NULL;
     }
     p_buf = (char *) VirtualAlloc(addr,
                                  size_of_reserve,  // size of Reserve
                                  MEM_RESERVE,
                                  PAGE_READWRITE);
 }}}

 It also looks like the main Go implementation does this:

 {{{
                 // On a 64-bit machine, allocate from a single contiguous
 reservation.
                 // 128 GB (MaxMem) should be big enough for now.
                 //
                 // The code will work with the reservation at any address,
 but ask
                 // SysReserve to use 0x0000XXc000000000 if possible
 (XX=00...7f).
                 // Allocating a 128 GB region takes away 37 bits, and the
 amd64
                 // doesn't let us choose the top 17 bits, so that leaves
 the 11 bits
                 // in the middle of 0x00c0 for us to choose.  Choosing
 0x00c0 means
                 // that the valid memory addresses will begin 0x00c0,
 0x00c1, ..., 0x00df.
                 // In little-endian, that's c0 00, c1 00, ..., df 00. None
 of those are valid
                 // UTF-8 sequences, and they are otherwise as far away
 from
                 // ff (likely a common byte) as possible.  If that fails,
 we try other 0xXXc0
                 // addresses.  An earlier attempt to use 0x11f8 caused out
 of memory errors
                 // on OS X during thread allocations.  0x00c0 causes
 conflicts with
                 // AddressSanitizer which reserves all memory up to
 0x0100.
                 // These choices are both for debuggability and to reduce
 the
                 // odds of the conservative garbage collector not
 collecting memory
                 // because some non-pointer block of memory had a bit
 pattern
                 // that matched a memory address.
                 //
                 // Actually we reserve 136 GB (because the bitmap ends up
 being 8 GB)
                 // but it hardly matters: e0 00 is not valid UTF-8 either.
                 //
                 // If this fails we fall back to the 32 bit memory
 mechanism
                 arena_size = MaxMem;
                 bitmap_size = arena_size / (sizeof(void*)*8/4);
                 spans_size = arena_size / PageSize *
 sizeof(runtime·mheap.spans[0]);
                 spans_size = ROUND(spans_size, PageSize);
                 for(i = 0; i <= 0x7f; i++) {
                         p = (void*)(i<<40 | 0x00c0ULL<<32);
                         p_size = bitmap_size + spans_size + arena_size +
 PageSize;
                         p = runtime·SysReserve(p, p_size, &reserved);
                         if(p != nil)
                                 break;
                 }

 }}}

 V8 does not appear to do this. I haven't checked any more yet.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:5
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler