[GHC] #9706: New block-structured heap organization for 64-bit

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 7.8.3 Keywords: | Operating System: Architecture: Unknown/Multiple | Unknown/Multiple Difficulty: Unknown | Type of failure: Blocked By: | None/Unknown Related Tickets: | Test Case: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- I was having some discussion about GHC's block structured heap with Sergio Benitez and Adam Belay, and during the discussion it was suggested that the way GHC manages the block structured heap is suboptimal when we're on 64-bit architectures. At the moment, we allocate memory from the operating system per-megablock, storing metadata in the very first megablock. We have to do this because, on 32-bit, we can't generally be too picky about what address our memory ends up living. On 64-bits, we have a lot more flexibility. Here is the proposal: 1. Statically decide on a maximum heap size in a power of two. 2. Next, probe for some appropriately aligned chunk of available virtual address space for this. On POSIX, we can mmap /dev/null using PROT_NONE and MAP_NORESERVE. On Windows, we can use VirtualAlloc with MEM_RESERVE. (There are few other runtimes which do this trick, including GCC Go.) 3. Divide this region into blocks as before. The maximum heap size is now the megablock size, and the block size is still the same as before. Masking to find the block descriptor works as before. 4. To allocate, we keep track of the high-watermark, and mmap in 1MB pages as they are requested. We also keep track of how much metadata we need, and mmap extra pages to store metadata as necessary. We still want to request memory from the operating system in conveniently sized chunks, but we can now abolish the notion of a megablock and the megablock allocator, and work purely with block coalescing. Additionally, the recorded heap location means that we can check if a pointer is HEAP_ALLOCED using a mask and equality check, solving #8199. What do people think? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by simonpj): Sounds plausible to me, but what is the benefit? We seem to get * More complicated code (since we still need the megablocks for 32-bit) in exchange for... what? I'm sure there is something, but it would be worth making the cost/benefit tradeoff explicit. Also, as 32-bit architectures wane, would there be a simpler but perhaps- less-performant fallback that would allow the code to be simplified for all architectures? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by simonmar): There are two parts to this: 1. put all the heap memory together in the address space 2. change where we put the block descriptors Let's just worry about (1) for now. * Windows doesn't overcommit, so we can't do this on Windows * How do we decide what the max heap size should be? Total mem + swap? * I'm worried about the overhead of having to pre-allocate a large number of page tables for the reserved address space * Every Haskell process would show a ridiculous VSIZE Still, from a purely selfish perspective, for those on 64-bit Linux this might make sense and I'd be happy to take that 8% GC win for very little effort, even if it is non-portable. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by refold): Replying to [comment:2 simonmar]:
* Windows doesn't overcommit, so we can't do this on Windows
From [http://msdn.microsoft.com/en- us/library/windows/desktop/aa366887%28v=vs.85%29.aspx the documentation for VirtualAlloc] it seems like it's possible to first reserve a memory range with `MEM_RESERVE` and then commit it as needed with `MEM_COMMIT`. [http://blogs.technet.com/b/markrussinovich/archive/2008/11/17/3155406.aspx This article] confirms it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by tibbe): You mention that GCC Go does this. Is there any other prior art? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by ezyang):
in exchange for... what?
Also, as 32-bit architectures wane, would there be a simpler but
The primary benefits of reorganizing the code in this way are: * We get to eliminate the HEAP_ALLOCED check (#8199), which we've seen experimentally to improve performance by about 8% on GC heavy benchmarks, and even more for processes with large heaps. Additionally, we should see a (minor) further improvement, because we no longer have to spend a pile of time at startup copying static data to the heap or follow indirections in code (which is the case for the current patchset). We get to avoid committing a long and complicated patchset. * The HEAP_ALLOCED check gets modestly simpler; we still have to maintain 32-bit megablock handling code, but we can eject most of the bookkeeping involved with managing 64-bit memory mapping * When we allocate data that spans multiple (today's) megablocks, we no longer have to waste the slop space afterwards. This has to be left empty today because the block descriptor was clobbered. perhaps-less-performant fallback that would allow the code to be simplified for all architectures? Unfortunately, the main concept (reserve a huge chunk of virtual address space) doesn't work at all on 32-bit, so I don't think it's possible to simplify the code at all here.
How do we decide what the max heap size should be? Total mem + swap?
I thought about this a bit; I suspect for performance reasons we will actually want this hard-coded into the runtime, since the size of this fragment is going to be built into the mask we do when doing the Bdescr calculation. It's not bad if HEAP_ALLOCED does a memory dereference to figure out the base address of the heap, but adding an extra memory dereference to Bdescr might be too much.
I'm worried about the overhead of having to pre-allocate a large number of page tables for the reserved address space
also
You mention that GCC Go does this. Is there any other prior art?
My understanding is this is pretty common for runtimes which use a contiguous heap. I did some quick looks and the Hotspot JVM runtime also does this: {{{ // first reserve enough address space in advance since we want to be // able to break a single contiguous virtual address range into multiple // large page commits but WS2003 does not allow reserving large page space // so we just use 4K pages for reserve, this gives us a legal contiguous // address space. then we will deallocate that reservation, and re alloc // using large pages const size_t size_of_reserve = bytes + _large_page_size; if (bytes > size_of_reserve) { // Overflowed. warning("Individually allocated large pages failed, " "use -XX:-UseLargePagesIndividualAllocation to turn off"); return NULL; } p_buf = (char *) VirtualAlloc(addr, size_of_reserve, // size of Reserve MEM_RESERVE, PAGE_READWRITE); }}} It also looks like the main Go implementation does this: {{{ // On a 64-bit machine, allocate from a single contiguous reservation. // 128 GB (MaxMem) should be big enough for now. // // The code will work with the reservation at any address, but ask // SysReserve to use 0x0000XXc000000000 if possible (XX=00...7f). // Allocating a 128 GB region takes away 37 bits, and the amd64 // doesn't let us choose the top 17 bits, so that leaves the 11 bits // in the middle of 0x00c0 for us to choose. Choosing 0x00c0 means // that the valid memory addresses will begin 0x00c0, 0x00c1, ..., 0x00df. // In little-endian, that's c0 00, c1 00, ..., df 00. None of those are valid // UTF-8 sequences, and they are otherwise as far away from // ff (likely a common byte) as possible. If that fails, we try other 0xXXc0 // addresses. An earlier attempt to use 0x11f8 caused out of memory errors // on OS X during thread allocations. 0x00c0 causes conflicts with // AddressSanitizer which reserves all memory up to 0x0100. // These choices are both for debuggability and to reduce the // odds of the conservative garbage collector not collecting memory // because some non-pointer block of memory had a bit pattern // that matched a memory address. // // Actually we reserve 136 GB (because the bitmap ends up being 8 GB) // but it hardly matters: e0 00 is not valid UTF-8 either. // // If this fails we fall back to the 32 bit memory mechanism arena_size = MaxMem; bitmap_size = arena_size / (sizeof(void*)*8/4); spans_size = arena_size / PageSize * sizeof(runtime·mheap.spans[0]); spans_size = ROUND(spans_size, PageSize); for(i = 0; i <= 0x7f; i++) { p = (void*)(i<<40 | 0x00c0ULL<<32); p_size = bitmap_size + spans_size + arena_size + PageSize; p = runtime·SysReserve(p, p_size, &reserved); if(p != nil) break; } }}} V8 does not appear to do this. I haven't checked any more yet. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by ezyang): Sorry, it looks like I misread the Hotspot code, it's using the reservations to do something else. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by carter): Would this simplify the engineering needed to eg support "capability/thread local heaps"? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by ezyang): No, no difference there. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by rwbarton):
Next, probe for some appropriately aligned chunk of available virtual address space for this. On POSIX, we can mmap /dev/null using PROT_NONE and MAP_NORESERVE.
Is this different from using MAP_ANONYMOUS? BTW, I found that I could mmap 100 TB with PROT_NONE (or even PROT_READ) and MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED with no measurable delay, and I can start 10000 such processes at once, so there doesn't seem to be any significant cost to setting up the page table mappings (at least on my system). Not sure why that is, exactly. The VSZ column in ps looks quite funny of course :) Also, this was on a system with overcommit disabled (vm.overcommit_memory = 2). So PROT_NONE or PROT_READ pages don't count against the commit limit.
To allocate, we keep track of the high-watermark, and mmap in 1MB pages as they are requested. We also keep track of how much metadata we need, and mmap extra pages to store metadata as necessary.
By "mmap" here do you mean using mprotect() to make parts of our reserved area writable, or something else? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by rwbarton): Replying to [comment:9 rwbarton]:
Not sure why that is, exactly.
Oh, I guess there ''are'' no page table mappings until we access the memory, causing a segmentation fault which the kernel handles by consulting the process's memory map table and allocating a new page with corresponding a page table mapping. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by thoughtpolice):
I'm worried about the overhead of having to pre-allocate a large number of page tables for the reserved address space
Like Reid mentioned, this shouldn't be too much of an issue on Linux, at least, until we get the initial page fault to map things in the first place. But beyond that, we can possibly mitigate some TLB/mapping thrashes a bit on Linux at least using hugetables, which should beef up the page sizes from 4k to 1MB or so. On older Linux systems though this is a bit of a basket case, since you need `hugetblfs`, which is super annoying and before transparent hugepages/mmap flags were introduced in glibc. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by dfeuer): Is it safe to allocate, say, half the address space for the heap? That would offer enough room to grow for a while. Anything that limits the heap to a few terabytes could become a serious limitation within a fairly short time. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by simonmar): Ok, so there's no problem with page tables for the reserved region on Linux. Using total memory + swap is a safe limit for the address space to reserve, since we never want to overcommit with actual heap memory. I still think we should separate this from the question of reorganising the block descriptors, which is probably a good idea and depends on this, but can be tackled separately. Open questions: * Does VirtualAlloc on Windows behave the same way? That is, can we allocate as much address space as we want, without creating page tables and without getting an out of memory error? * What do we do when overcommit is disabled on Linux? * What about OS X? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by rwbarton): I did my experiments on a Linux system with overcommit disabled and watched the `Committed_AS` number from `/proc/meminfo`. According to my tests, anonymous pages which have never been mapped with `PROT_WRITE` don't count against the commit limit. Once a page is given `PROT_WRITE` permissions then it counts as committed even if `PROT_WRITE` is removed before the page is ever touched. I would guess we can "un-commit" pages by doing a new mmap over them with `PROT_NONE`, but I didn't test this. Carter tested that on OS X, mmapping 100 TB doesn't take any noticeable time, at least. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by refold): Replying to [comment:13 simonmar]:
Open questions:
* Does VirtualAlloc on Windows behave the same way? That is, can we allocate as much address space as we want, without creating page tables and without getting an out of memory error?
According to chapter 7 of Russinovich's book (4th ed., which covers XP, 2000 and Server 2003), page tables are allocated lazily. I did some tests on Windows 7 ([https://gist.github.com/23Skidoo/ae5cedb5e4717e96de3d here's my test program]): reserving a terabyte of memory doesn't take any noticeable time; however, the system refuses to reserve more than 1423 GB. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: simonmar Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by gcampax): * cc: gcampax (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: gcampax Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: D524 | -------------------------------------+------------------------------------- Changes (by ezyang): * owner: simonmar => gcampax * differential: => D524 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: gcampax Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: D524 | -------------------------------------+------------------------------------- Comment (by gcampax): nofib results from patch in D524: {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- anna -0.0% 0.0% 0.107 0.107 0.0% ansi -0.1% 0.0% 0.000 0.000 0.0% atom -0.1% 0.0% -3.6% -3.6% 0.0% awards -0.0% 0.0% 0.000 0.000 0.0% banner -0.1% 0.0% 0.000 0.000 0.0% bernouilli -0.1% 0.0% -1.2% -1.2% 0.0% binary-trees -0.1% 0.0% -4.6% -4.6% 0.0% boyer -0.1% 0.0% 0.046 0.046 0.0% boyer2 -0.1% 0.0% 0.009 0.009 0.0% bspt -0.0% 0.0% 0.010 0.010 0.0% cacheprof -0.0% -0.3% +0.2% +0.2% +3.5% calendar -0.0% 0.0% 0.000 0.000 0.0% cichelli -0.1% 0.0% 0.088 0.088 +3.2% circsim -0.1% 0.0% -1.0% -1.0% 0.0% clausify -0.1% 0.0% 0.041 0.040 0.0% comp_lab_zift -0.1% 0.0% -2.0% -1.9% 0.0% compress -0.1% 0.0% 0.189 0.189 0.0% compress2 -0.1% 0.0% 0.173 0.173 0.0% constraints -0.1% 0.0% -6.0% -6.0% 0.0% cryptarithm1 -0.1% 0.0% -0.4% -0.3% 0.0% cryptarithm2 -0.1% 0.0% 0.011 0.011 0.0% cse -0.1% 0.0% 0.002 0.002 0.0% eliza -0.1% 0.0% 0.001 0.001 0.0% event -0.1% 0.0% 0.153 0.153 0.0% exp3_8 -0.1% 0.0% -1.4% -1.4% 0.0% expert -0.0% 0.0% 0.000 0.000 0.0% fannkuch-redux -0.1% 0.0% +0.6% +0.6% 0.0% fasta -0.0% 0.0% +0.1% +0.2% 0.0% fem -0.1% 0.0% 0.024 0.024 0.0% fft -0.0% 0.0% 0.034 0.034 0.0% fft2 -0.0% 0.0% 0.049 0.049 0.0% fibheaps -0.1% 0.0% 0.030 0.030 0.0% fish -0.1% 0.0% 0.013 0.013 0.0% fluid -0.0% 0.0% 0.010 0.010 0.0% fulsom -0.1% 0.0% -5.5% -5.5% 0.0% gamteb -0.1% 0.0% 0.042 0.042 0.0% gcd -0.1% 0.0% 0.048 0.048 0.0% gen_regexps -0.1% 0.0% 0.000 0.000 0.0% genfft -0.1% 0.0% 0.035 0.035 0.0% gg -0.0% 0.0% 0.011 0.011 0.0% grep -0.0% 0.0% 0.000 0.000 0.0% hidden -0.1% 0.0% -9.1% -9.0% 0.0% hpg -0.1% 0.0% 0.114 0.114 0.0% ida -0.1% 0.0% 0.090 0.090 0.0% infer -0.1% 0.0% 0.052 0.052 0.0% integer -0.1% 0.0% +0.9% +1.0% 0.0% integrate -0.1% 0.0% 0.121 0.121 0.0% k-nucleotide -0.0% 0.0% -10.8% -10.8% 0.0% kahan -0.0% 0.0% +0.1% +0.1% 0.0% knights -0.1% 0.0% 0.005 0.005 0.0% lcss -0.1% 0.0% -6.8% -6.8% 0.0% life -0.1% 0.0% -4.1% -4.0% 0.0% lift -0.0% 0.0% 0.002 0.002 0.0% listcompr -0.1% 0.0% 0.101 0.101 0.0% listcopy -0.1% 0.0% 0.107 0.107 0.0% maillist -0.1% 0.0% 0.060 0.060 +2.2% mandel -0.0% 0.0% 0.077 0.077 0.0% mandel2 -0.1% 0.0% 0.004 0.004 0.0% minimax -0.0% 0.0% 0.003 0.003 0.0% mkhprog -0.1% 0.0% 0.003 0.003 0.0% multiplier -0.1% 0.0% 0.147 0.147 0.0% n-body -0.1% 0.0% +1.5% +1.5% 0.0% nucleic2 -0.1% 0.0% 0.068 0.068 0.0% para -0.1% 0.0% +1.6% +1.5% 0.0% paraffins -0.1% 0.0% 0.112 0.112 0.0% parser -0.1% 0.0% 0.030 0.030 0.0% parstof -0.1% 0.0% 0.006 0.006 0.0% pic -0.1% 0.0% 0.007 0.007 0.0% pidigits -0.1% 0.0% -0.1% -0.0% 0.0% power -0.1% 0.0% -7.8% -7.8% 0.0% pretty -0.1% 0.0% 0.000 0.000 0.0% primes -0.1% 0.0% 0.070 0.070 0.0% primetest -0.1% 0.0% 0.130 0.130 0.0% prolog -0.1% 0.0% 0.002 0.002 0.0% puzzle -0.1% 0.0% 0.148 0.148 0.0% queens -0.1% 0.0% 0.015 0.015 0.0% reptile -0.1% 0.0% 0.011 0.011 0.0% reverse-complem -0.1% 0.0% 0.117 0.117 0.0% rewrite -0.1% 0.0% 0.018 0.018 0.0% rfib -0.1% 0.0% 0.020 0.020 0.0% rsa -0.1% 0.0% 0.028 0.028 0.0% scc -0.1% 0.0% 0.000 0.000 0.0% sched -0.1% 0.0% 0.023 0.023 0.0% scs -0.0% 0.0% -2.4% -2.5% 0.0% simple -0.0% 0.0% -2.7% -2.7% 0.0% solid -0.1% 0.0% 0.154 0.154 0.0% sorting -0.1% 0.0% 0.002 0.002 0.0% spectral-norm -0.1% 0.0% -0.0% -0.0% 0.0% sphere -0.1% 0.0% 0.053 0.053 0.0% symalg -0.1% 0.0% 0.015 0.015 0.0% tak -0.1% 0.0% 0.016 0.016 0.0% transform -0.1% 0.0% -1.2% -1.1% 0.0% treejoin -0.1% 0.0% 0.144 0.144 0.0% typecheck -0.1% 0.0% -13.9% -13.8% 0.0% veritas -0.0% 0.0% 0.002 0.002 0.0% wang -0.1% 0.0% 0.119 0.119 0.0% wave4main -0.1% 0.0% -3.4% -3.4% 0.0% wheel-sieve1 -0.1% 0.0% -0.3% -0.4% 0.0% wheel-sieve2 -0.1% 0.0% -2.8% -2.8% 0.0% x2n1 -0.1% 0.0% 0.005 0.005 0.0% -------------------------------------------------------------------------------- Min -0.1% -0.3% -13.9% -13.8% 0.0% Max -0.0% 0.0% +1.6% +1.5% +3.5% Geometric Mean -0.1% -0.0% -2.9% -2.9% +0.1% }}} I think it's also significant to look at GC elapsed time: {{{ ------------------------------------------------------------------------------- Program nofib-log-old nofib-log-new ------------------------------------------------------------------------------- anna 0.025 0.024 ansi 0.000 0.000 atom 0.283 -6.4% awards 0.000 0.000 banner 0.000 0.000 bernouilli 0.075 0.072 binary-trees 0.323 -6.6% boyer 0.023 0.021 boyer2 0.004 0.003 bspt 0.005 0.005 cacheprof 0.145 0.143 calendar 0.000 0.000 cichelli 0.017 0.016 circsim 0.577 -6.3% clausify 0.015 0.013 comp_lab_zift 0.098 0.094 compress 0.082 0.079 compress2 0.145 0.140 constraints 2.335 -8.8% cryptarithm1 0.022 0.021 cryptarithm2 0.001 0.001 cse 0.001 0.001 eliza 0.000 0.000 event 0.088 0.082 exp3_8 0.039 0.037 expert 0.000 0.000 fannkuch-redux 0.007 0.007 fasta 0.004 0.004 fem 0.006 0.006 fft 0.023 0.022 fft2 0.019 0.018 fibheaps 0.013 0.012 fish 0.001 0.001 fluid 0.002 0.002 fulsom 0.205 -7.4% gamteb 0.009 0.009 gcd 0.001 0.001 gen_regexps 0.000 0.000 genfft 0.015 0.014 gg 0.005 0.005 grep 0.000 0.000 hidden 0.022 0.021 hpg 0.018 0.017 ida 0.028 0.027 infer 0.026 0.024 integer 0.025 0.024 integrate 0.002 0.002 k-nucleotide 0.025 0.024 kahan 0.001 0.000 knights 0.001 0.001 lcss 0.359 -8.8% life 0.155 0.143 lift 0.000 0.001 listcompr 0.005 0.005 listcopy 0.006 0.005 maillist 0.013 0.012 mandel 0.004 0.004 mandel2 0.001 0.001 minimax 0.001 0.001 mkhprog 0.001 0.001 multiplier 0.042 0.039 n-body 0.009 0.009 nucleic2 0.005 0.004 para 0.095 0.088 paraffins 0.099 0.091 parser 0.012 0.011 parstof 0.002 0.002 pic 0.004 0.004 pidigits 0.044 0.043 power 0.282 -11.9% pretty 0.000 0.000 primes 0.023 0.021 primetest 0.002 0.002 prolog 0.001 0.001 puzzle 0.019 0.018 queens 0.000 0.000 reptile 0.005 0.004 reverse-complem 0.001 0.001 rewrite 0.001 0.001 rfib 0.000 0.000 rsa 0.001 0.001 scc 0.000 0.000 sched 0.002 0.002 scs 0.243 -6.2% simple 0.081 0.074 solid 0.087 0.080 sorting 0.001 0.001 spectral-norm 0.000 0.000 sphere 0.005 0.005 symalg 0.001 0.001 tak 0.000 0.000 transform 0.095 0.089 treejoin 0.092 0.087 typecheck 0.013 0.012 veritas 0.001 0.001 wang 0.090 0.087 wave4main 0.118 0.110 wheel-sieve1 0.031 0.029 wheel-sieve2 0.165 0.159 x2n1 0.000 0.000 -1 s.d. ----- -9.7% +1 s.d. ----- -5.9% Average ----- -7.8% }}} Results are comparing 4897e7 (up to date master) with a2de1f (the patch), on a 16 core Intel Xeon 2.40GHz, 48GB RAM total, no swap, overcommit_memory = 0 (the default), on a x86_64 Linux 3.12.9 It would be interesting to see why certain programs are regressing. They seem to be multithreaded programs, so it could be that we're spending more time in the memory allocator, which is locked and kills multithreading. Or it could be that they trigger pessimal allocator behavior. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: gcampax Type: task | Status: new Priority: normal | Milestone: Component: Runtime | Version: 7.8.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: D524 | -------------------------------------+------------------------------------- Comment (by simonmar): Anything less than 2% is probably noise, unless you can reproduce it consistently and/or there is some objective measurement such as number of instructions or memory reads using perf. There are other random effects that can cause small wobbles, even code placement by the linker or alignment effects. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit
-------------------------------------+-------------------------------------
Reporter: ezyang | Owner: gcampax
Type: task | Status: new
Priority: normal | Milestone:
Component: Runtime System | Version: 7.8.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Revisions: D524
-------------------------------------+-------------------------------------
Comment (by Simon Marlow

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: gcampax Type: task | Status: closed Priority: normal | Milestone: Component: Runtime System | Version: 7.8.3 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: D524 -------------------------------------+------------------------------------- Changes (by simonmar): * status: new => closed * resolution: => fixed Comment: Thanks to everyone who worked on this, particularly @ezyang for exploring the earlier solution and @gcampax for writing most of the patch I just committed. @ezyang, I suggest making a separate ticket for experiments with reorganising the block descriptors if you want to follow up on that. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit -------------------------------------+------------------------------------- Reporter: ezyang | Owner: gcampax Type: task | Status: closed Priority: normal | Milestone: Component: Runtime System | Version: 7.8.3 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: Phab:D524 -------------------------------------+------------------------------------- Changes (by rwbarton): * differential: D524 => Phab:D524 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9706#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9706: New block-structured heap organization for 64-bit
-------------------------------------+-------------------------------------
Reporter: ezyang | Owner: gcampax
Type: task | Status: closed
Priority: normal | Milestone:
Component: Runtime System | Version: 7.8.3
Resolution: fixed | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D524
-------------------------------------+-------------------------------------
Comment (by Herbert Valerio Riedel
participants (1)
-
GHC