[GHC] #8885: Add inline versions of clone array primops

#8885: Add inline versions of clone array primops ------------------------------------+------------------------------------- Reporter: tibbe | Owner: simonmar Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Keywords: | Operating System: Unknown/Multiple Architecture: Unknown/Multiple | Type of failure: None/Unknown Difficulty: Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | ------------------------------------+------------------------------------- I've changed the clone array primops (i.e. `cloneArray#`, `cloneMutableArray#`, `freezeArray#`, and `thawArray#`) to use the new inline allocation optimization for statically known array sizes. Furthermore, I've moved the implementation for the non-statically known case out-of-line, which should reduce code size. The numbers are very encouraging, with the new implementation being 74% (i.e. almost 4x) faster than the old one. I measured this by looking at the total time reported by `+RTS -s` for the attached `InlineCloneArrayAlloc` benchmark. Here are the stats from the best out of three runs of the old implementation: {{{ 1,600,041,120 bytes allocated in the heap 6,504 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 21,352 bytes maximum slop 1588 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1 colls, 0 par 0.01s 0.01s 0.0082s 0.0082s Gen 1 1 colls, 0 par 0.00s 0.11s 0.1062s 0.1062s INIT time 0.00s ( 0.00s elapsed) MUT time 0.29s ( 0.57s elapsed) GC time 0.01s ( 0.11s elapsed) EXIT time 0.01s ( 0.11s elapsed) Total time 0.31s ( 0.80s elapsed) %GC time 2.7% (14.2% elapsed) Alloc rate 5,497,251,856 bytes per MUT second Productivity 97.3% of total user, 37.4% of total elapsed }}} Here are the same for the new implementation: {{{ 1,600,041,120 bytes allocated in the heap 57,224 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 21,352 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 3125 colls, 0 par 0.01s 0.01s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s INIT time 0.00s ( 0.00s elapsed) MUT time 0.08s ( 0.08s elapsed) GC time 0.01s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.08s ( 0.09s elapsed) %GC time 6.4% (8.8% elapsed) Alloc rate 21,260,179,643 bytes per MUT second Productivity 93.5% of total user, 88.8% of total elapsed }}} The performance ratio between the new and old implementation gets worse for the old implementation as the iteration count is increased. There's also an interesting difference in the Gen 1 collection times between the two implementations. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Changes (by tibbe): * status: new => patch -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by tibbe): I did some investigation to figure out why the old implementation holds on to all memory (assuming that it's not an accounting error). First, here's what the inner loop looks like for the old implementation: {{{ $wa_entry() // [R3, R2] { [(c3i2, $wa_info: const 12884901901; const 0; const 15;)] } {offset c3i2: if (R2 != 0) goto c3i0; else goto c3i1; c3i0: _s3f9::P64 = R3; (_c3hS::I64) = call "ccall" arg hints: [PtrHint,] result hints: [PtrHint] allocate(BaseReg - 24, 20); I64[_c3hS::I64] = I64[PicBaseReg + stg_MUT_ARR_PTRS_FROZEN0_info@GOTPCREL]; I64[_c3hS::I64 + 8] = 16; I64[_c3hS::I64 + 16] = 17; _c3i8::I64 = _c3hS::I64 + 24; call MO_Memcpy(_c3i8::I64, _s3f9::P64 + 24, 128, 8); call MO_Memset(_c3i8::I64 + 128, 1, 1, 8); R3 = _s3f9::P64; R2 = R2 - 1; call $wa_info(R3, R2) args: 8, res: 0, upd: 8; c3i1: R1 = PicBaseReg + (()_closure+1); call (P64[Sp])(R1) args: 8, res: 0, upd: 8; } } }}} The first thing I did was to use the correct info table (`FROZEN`, not `FROZEN0`). That didn't help. The second thing I did was to use `-fno-omit-yields` to try to make sure the GC was run, as there's no heap check in this loop (and I don't know if `allocate` ever invokes the GC). That didn't have any effect. For reference, here's what the code with `-fno-omit-yields` looks like: {{{ $wa_entry() // [R3, R2] { [(c3i7, $wa_info: const 12884901901; const 0; const 15;)] } {offset c3i7: if (I64[BaseReg + 856] == 0) goto c3i8; else goto c3ia; c3i8: // nop // nop R1 = PicBaseReg + $wa_closure; call (I64[BaseReg - 8])(R3, R2, R1) args: 8, res: 0, upd: 8; c3ia: if (R2 != 0) goto c3i5; else goto c3i6; c3i5: _s3f9::P64 = R3; (_c3hX::I64) = call "ccall" arg hints: [PtrHint,] result hints: [PtrHint] allocate(BaseReg - 24, 20); I64[_c3hX::I64] = I64[PicBaseReg + stg_MUT_ARR_PTRS_FROZEN_info@GOTPCREL]; I64[_c3hX::I64 + 8] = 16; I64[_c3hX::I64 + 16] = 17; _c3ie::I64 = _c3hX::I64 + 24; call MO_Memcpy(_c3ie::I64, _s3f9::P64 + 24, 128, 8); call MO_Memset(_c3ie::I64 + 128, 1, 1, 8); R3 = _s3f9::P64; R2 = R2 - 1; call $wa_info(R3, R2) args: 8, res: 0, upd: 8; c3i6: R1 = PicBaseReg + (()_closure+1); call (P64[Sp])(R1) args: 8, res: 0, upd: 8; } } }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by tibbe): Both patches validate (module unrelated failures to type signature printing that's breaking validate in HEAD at the moment.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by simonmar):
1588 MB total memory in use (0 MB lost due to fragmentation)
Hehe, there was a bug in the old implementation, it didn't check for heap overflow so this benchmark never GC'd. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by simonmar): So we don't know how much of the speed increase is due to fixing the bug, and how much is due to inlining the array alloc. Still, it's probably better to inline the array alloc. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

So we don't know how much of the speed increase is due to fixing the bug, and how much is due to inlining the array alloc. Still, it's
#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by tibbe): Replying to [comment:5 simonmar]: probably better to inline the array alloc. Right. Is the bug in the implementation of `allocate` or in the implementation of the primops? If it's in the primops this bug exists in at least `stg_newArrayzh` as well, as I stole the allocation code from there. If you let me know when the bug is fixed, I will rerun the benchmark. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by simonmar): The bug is in the primop. The out-of-line implementation doesn't have the bug: look for `MAYBE_GC()`, that's the heap-overflow check. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by tibbe): There's not much point in comparing the new inline version vs the old, incorrect version. Fixing the old, incorrect version just to get the benchmark numbers doesn't seem worth it, as we'd have to replicate `MAYBE_GC` in `StgCmmPrim`. Instead I compare the new inline version against the new out-of-line version (which calls `allocate`). The inline versions is 69% faster. Here are the `+RTS -s` numbers for the new out-of-line version: {{{ 1,600,041,120 bytes allocated in the heap 57,992 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 21,352 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 3173 colls, 0 par 0.01s 0.01s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.01s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.26s ( 0.26s elapsed) %GC time 2.1% (2.9% elapsed) Alloc rate 6,417,285,798 bytes per MUT second Productivity 97.9% of total user, 95.6% of total elapsed }}} And for the inline version: {{{ 1,600,041,120 bytes allocated in the heap 57,224 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 21,352 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 3125 colls, 0 par 0.00s 0.01s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.08s ( 0.08s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.08s ( 0.09s elapsed) %GC time 6.1% (8.2% elapsed) Alloc rate 20,999,017,271 bytes per MUT second Productivity 93.9% of total user, 89.2% of total elapsed }}} You can see that the GC issue has been fixed. I've attached updated versions of my patches that address the `MAYBE_GC` issue, which was also present in my new out-of-line implementation. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by tibbe): Marking these "out-of-line but may be inlined" primops as out-of-line in `primops.txt.pp` had the side effect of always adding an extra heap check for them, removing some of the benefits of inlining them in the first place. I've attached a patch that makes the heap check layout code respect the decision to inline these primops. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by tibbe): My intuition to use a pointer based copy loop instead of an index based loop seems to have been wrong. Changing the loop from: {{{ dst_p = dst + SIZEOF_StgMutArrPtrs; src_p = src + SIZEOF_StgMutArrPtrs + WDS(offset); while: if (n != 0) { n = n - 1; W_[dst_p] = W_[src_p]; dst_p = dst_p + WDS(1); src_p = src_p + WDS(1); goto while; } }}} to {{{ dst_p = dst + SIZEOF_StgMutArrPtrs; src_p = src + SIZEOF_StgMutArrPtrs + WDS(offset); i = 0; while: if (i < n) { W_[dst_p + i] = W_[src_p + i]; i = i + 1; goto while; } }}} improves performance on my benchmark (clone array of 17 elements) by 14%. Attached patch with this improvement. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops
-------------------------------------+------------------------------------
Reporter: tibbe | Owner: simonmar
Type: feature request | Status: patch
Priority: normal | Milestone:
Component: Compiler | Version: 7.9
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture: Unknown/Multiple
Type of failure: None/Unknown | Difficulty: Unknown
Test Case: | Blocked By:
Blocking: | Related Tickets:
-------------------------------------+------------------------------------
Comment (by tibbe):
I've rebased the first three patches, validated them, and submitted them
as:
In [changeset:1eece45692fb5d1a5f4ec60c1537f8068237e9c1/ghc]:
{{{
commit 1eece45692fb5d1a5f4ec60c1537f8068237e9c1
Author: Johan Tibell

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Changes (by tibbe): * status: patch => closed * resolution: => fixed -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8885: Add inline versions of clone array primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: simonmar Type: feature request | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by simonmar): Sorry I didn't get around to reviewing this. I just skimmed the commit and it all looks very reasonable though. Nice work :) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC