[GHC] #8732: Global big object heap allocator lock causes contention

#8732: Global big object heap allocator lock causes contention ------------------------------------+------------------------------------- Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 7.6.3 Keywords: | Operating System: Unknown/Multiple Architecture: Unknown/Multiple | Type of failure: None/Unknown Difficulty: Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | ------------------------------------+------------------------------------- The lock allocate() takes when allocating big objects hurts scalability of I/O bound application. Network.Socket.ByteString.recv is typically called with a buffer size of 4096, which causes a ByteString of that size to be allocate. The size of this ByteString causes it to be allocated from the big object space, which causes contention of the global lock that guards that space. See http://www.yesodweb.com/blog/2014/02/new-warp for a real world example. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Description changed by tibbe: Old description:
The lock allocate() takes when allocating big objects hurts scalability of I/O bound application. Network.Socket.ByteString.recv is typically called with a buffer size of 4096, which causes a ByteString of that size to be allocate. The size of this ByteString causes it to be allocated from the big object space, which causes contention of the global lock that guards that space.
See http://www.yesodweb.com/blog/2014/02/new-warp for a real world example.
New description: The lock `allocate` takes when allocating big objects hurts scalability of I/O bound application. `Network.Socket.ByteString.recv` is typically called with a buffer size of 4096, which causes a `ByteString` of that size to be allocated. The size of this `ByteString` causes it to be allocated from the big object space, which causes contention of the global lock that guards that space. See http://www.yesodweb.com/blog/2014/02/new-warp for a real world example. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention --------------------------------------------+------------------------------ Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Changes (by hvr): * cc: hvr (added) * failure: None/Unknown => Runtime performance bug * milestone: => 7.10.1 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention --------------------------------------------+------------------------------ Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by ezyang): It is a good thing that these blocks are considered big blocks, since we don't really want to be copying the buffers around. So one thought might be to make the large block list in generation-0 per-thread, and perform allocations from a thread-local block list. But you have to be careful: objects that are larger than a block need contiguous blocks, so unless you are only going to enable this for large objects that still fit in a single block, you'll have to maintain multiple lists with the sizes you want. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

But you have to be careful: objects that are larger than a block need contiguous blocks, so unless you are only going to enable this for large objects that still fit in a single block, you'll have to maintain multiple
#8732: Global big object heap allocator lock causes contention --------------------------------------------+------------------------------ Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): lists with the sizes you want. I think malloc already does that, so we could copy whatever they do perhaps. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention --------------------------------------------+------------------------------ Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by ezyang): It's pretty standard, yes (we implement for handling the global block pool), but it does mean all of that code would have to be made thread- local. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

It's pretty standard, yes (we implement for handling the global block
#8732: Global big object heap allocator lock causes contention --------------------------------------------+------------------------------ Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): Replying to [comment:5 ezyang]: pool), but it does mean all of that code would have to be made thread- local. I guess that means even worse performance problems on OS X? Even if it does, it sounds like the right thing to do. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention --------------------------------------------+------------------------------ Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by carter): @tibbe, becaue TLS is slow on OS X currently? (mind you, my understanding is that the other RTS issues go away when building GHC with a REAL GCC, right? I take it thats not the case for this discussion? ) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention --------------------------------------------+------------------------------ Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by ezyang): In this case, slowness of TLS is not an issue, because we manually pass around pointers to structs which are known to be per-capability (and can be accessed in an unsynchronized way), so you don't actually need thread- local *state*. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention --------------------------------------------+------------------------------ Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonmar): I don't really understand why in mighty he couldn't just re-use the same block. I'm kind of surprised that this is a bottleneck, and I think it needs more investigation. We only take the lock for large objects, so typically there's going to be a lot of computation going on per allocation. I suppose if it really is a problem then we could just have a per-thread block pool at the granularity of a megablock to avoid fragmentation issues. We just push the global lock back to the megablock free list. This has the danger that we might have a lot of free blocks owned by one thread that don't get used, though, so we might want to redistribute the free blocks at GC. Things start to get annoyingly complicated. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention --------------------------------------------+------------------------------ Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonmar): It's even harder than that, because a block can be allocated by one thread and freed by another thread, so we lose block coalescing, even if it can be made to work safely. So I suggest if we want to do anything at all here, we just do the really simple thing: we allocate a chunk of contiguous memory, keep it in the capability, and use that to satisfy large block requests if it's large enough. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention -------------------------------------+------------------------------------- Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime | Version: 7.6.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: Runtime | Related Tickets: performance bug | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by ihameed): * cc: idhameed@… (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention -------------------------------------+------------------------------------- Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime | Version: 7.6.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: Runtime | Related Tickets: performance bug | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by carter): with the new contiguous heap design for x86_64 systems that just got merged in, do some of the ideas here become easier? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention -------------------------------------+------------------------------------- Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Runtime | Version: 7.6.3 System | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: Runtime | Related Tickets: performance bug | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by ezyang): carter: contiguous heap has not been merged in, and it doesn't really help for this problem. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention -------------------------------------+------------------------------------- Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by tibbe): What about the idea of just using malloc? Modern mallocs like TCMalloc are already multithreaded and seem to just deal with all the annoying issues. Gregory Collins said that in Snap they just don't use the "built-in" ByteString construction functions and instead just call malloc. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention -------------------------------------+------------------------------------- Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonmar): Malloc is fine for ByteStrings, but we can't use it for heap-resident objects due to the way block descriptors work. Our memory is always MB- aligned, so that we can put the block descriptors at the beginning of the MB. Also the GC has to be able to distinguish heap memory from non-heap memory, and we currently take advantage of the fact that memory is allocated in MB chunks to reduce the granularity that we have to map the address space. The contiguous-heap patch solves this in a different way (that is also incompatible with malloc). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8732: Global big object heap allocator lock causes contention -------------------------------------+------------------------------------- Reporter: tibbe | Owner: simonmar Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by dobenour): Ask the TCMalloc or JEmalloc developers? They have solved this problem, and even if GHC can't use them directly, the algorithms used in them could be used. Also, I am wondering if the current large object limit is too small. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8732#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC