[GHC] #10678: integer-gmp's runS seems unnecessarily expensive

#10678: integer-gmp's runS seems unnecessarily expensive
-------------------------------------+-------------------------------------
Reporter: rwbarton | Owner:
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.10.1
(CodeGen) |
Keywords: | Operating System: Unknown/Multiple
Architecture: | Type of failure: Runtime
Unknown/Multiple | performance bug
Test Case: | Blocked By:
Blocking: | Related Tickets:
Differential Revisions: |
-------------------------------------+-------------------------------------
integer-gmp uses an unsafePerformIO-like operation to work with mutable
BigNats (unsafePerformIO and even the IO type are not yet available, since
integer-gmp is a dependency of base):
{{{
type S s a = State# s -> (# State# s, a #)
-- NB: equivalent of GHC.IO.unsafeDupablePerformIO, see notes there
runS :: S RealWorld a -> a
runS m = lazy (case m realWorld# of (# _, r #) -> r)
{-# NOINLINE runS #-}
}}}
It's tempting to think of such an operation as "free" like an
unsafeCoerce, but it is actually somewhat expensive.
Consider `plusBigNat` for instance. (Most BigNat operations have a similar
structure.)
{{{
plusBigNat :: BigNat -> BigNat -> BigNat
plusBigNat x y
| isTrue# (eqBigNatWord# x 0##) = y
| isTrue# (eqBigNatWord# y 0##) = x
| isTrue# (nx# >=# ny#) = go x nx# y ny#
| True = go y ny# x nx#
where
go (BN# a#) na# (BN# b#) nb# = runS $ do
mbn@(MBN# mba#) <- newBigNat# na#
(W# c#) <- liftIO (c_mpn_add mba# a# na# b# nb#)
case c# of
0## -> unsafeFreezeBigNat# mbn
_ -> unsafeSnocFreezeBigNat# mbn c#
nx# = sizeofBigNat# x
ny# = sizeofBigNat# y
}}}
The assembly for `go` begins
{{{
00000000000001d0

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by rwbarton): I note that bytestring and text have the same issue (using `unsafeDupablePerformIO` and `runSTRep` respectively in the role of `runS`) so this may be a known problem. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonpj): Reid, indeed it is known: #5916. (If you find any other tickets about this, please add them.) I think the best approach would be as in comment:16 of ticket:5916#comment:16 * '''Inline `runS` very late, in `CorePrep`''' Some thoughts about this. * We'd want a single, magical function which has this behaviour. Things like `unsafeDupablePerformIO` should call `runS`. * Now I think about it, I wonder if the magical function should instead be this: {{{ runRW# :: (State# s -> (# State s, a)) -> (# State s, a #) {-# NOINLINE runRW# #-} runRW# f = f realWorld# }}} That is, all `runRW#` does is to apply its argument to a fresh state token. Now we can define your `runS` thus: {{{ runS f = case ruNRW# f of (# _, r #) -> r }}} I think we can inline `runS` freely, which is good because it means that more code is exposed to the optimiser (in particular that `case`). * Moreover, if we do this, I think we now don't need `lazy`. Because `runRW#`'s strictness signature won't be strict in the `r` part; see `Note [unsafeDupablePerformIO has a lazy RHS]` in `GHC.IO`. * If `CorePrep` saw `runRW# (f a b)` it can generate `f a b realWorld#`. But if it sees `runRW# (\s.e)`, it should generate `e[realWorld#/s]`. That is, it should do the beta-reduction on the fly. That's slightly annoying because `CorePrep` doesn't currently carry around a substitution. But I suppose that you could literally call `substExpr`; this doesn't happen much. * Alternatively maybe this could be done in the code generator, effectively treating `runRW` as a primop. That would be a good plan, except that by the time we get to STG the program is in ANF, so instead of `runRW# (f a b)` we'd have {{{ let g = f a b in ruNRW# g }}} and before we see the `runRW#` we'll have generated code to allocate a closure for `g`. So we'd need to allow STG syntax (for specified primops) to have expressions, not just atoms, as the arguments to the primop. This might actually be a Jolly Good Thing. At the moment `catch# (\s -> e1) (\x s -> e2)` allocates two closures before it gets to the `catch#`, which is pretty stupid since the first thing we do after pushing the exception-catch frame is to execute `e1`. I think changes like this would make modest but significant improvements to many programs. If you'd like to dig into it, I'd gladly help. Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by rwbarton): I am working on a patch that adds such a `runRW#` function with special treatment in CorePrep, since it seems easy and most of the work could be reused for the special primop approach anyways. No benchmarks yet but it does produce the expected Cmm without the floating-out problems of `inlinePerformIO` in my simple test, and it passes validate (which might not greatly stress this case, but the new `runRW#` is being used from integer-gmp and I guess also from bytestring via `unsafeDupablePerformIO`). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonpj): Great! Thank you. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by rwbarton): Results are mildly promising so far but I ran into an unexpected snag related to CPR analysis that is undoing some of the gains. See #10694. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by rwbarton): * Attachment "runRW-nofib.txt" added. nofib results -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by rwbarton): With #10694 fixed I am pretty happy with the results so far. Allocations are down for Integer and ByteString users as expected. Module size, compile allocations and compile time are also slightly down on average. The variation in program runtime seems to be due to a combination of noise and #8279. I suspect real world modern Haskell programs may gain more on average due to preferring ByteString and Text over String. The only programs in nofib which use ByteString (none use Text) are three shootout programs that have been highly optimized by hand. Maybe I should try fibon? For a microbenchmark {{{ f :: B.ByteString -> B.ByteString f s = case B.uncons s of Just (c, s') -> B.snoc s' c Nothing -> B.empty }}} allocations are down from 136 bytes to 96 bytes and runtime from 16ns to 13ns (when `s` is a 9-byte string). I got roughly similar results from an integer-gmp benchmark (repeatedly adding 1 to a large Integer). There is more room for improvement, though. Both these microbenchmarks allocate a boxed heap value inside the `runRW#`, only to immediately unbox it outside the `runRW#`. Some kind of CPR analysis + w/w-type transformation could eliminate these intermediate allocations. I implemented this transformation manually in bytestring and it shaved off another 40 bytes of allocation (indeed the size of a `ByteString` heap object) and ~10% of the runtime. However it would be much nicer for GHC to do this automatically. Need to think more on the best way to accomplish this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonpj): Good! Can you give a small example of your last para? I bet it's something like {{{ case (runRW (\s -> I# x)) of (# _ , I# y) -> blah }}} I bet that you can make something similar happen with `catch#`. A new ticket, perhaps. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by hvr): * cc: bgamari (added) Comment: Interesting, there's some (albeit not perfect) overlap between the improved nofib cases with the regressed nofib cases noted in CompilerPerformance#Comparinginteger-gmp0.5and1.0 E.g. {{{ bernouilli -0.3% -8.2% -2.4% -2.4% 0.0% fasta -0.3% -5.1% -0.1% -0.1% 0.0% kahan -0.2% -11.4% +0.3% +0.3% 0.0% primetest -0.3% -18.5% -3.5% -3.6% 0.0% rsa -0.3% -19.8% 0.060 0.060 0.0% scs -0.4% -3.2% -2.0% -2.0% +14.3% symalg -0.3% -2.1% 0.024 0.024 0.0% wave4main -0.2% -4.1% -2.1% -2.1% -5.3% }}} vs ([https://gist.githubusercontent.com/bgamari/5de75ac998a346b70ce8/raw/ffc2bd4f... /nofib-integer-gmp-comparison.txt source]) {{{ Program Size Allocs Runtime Elapsed TotalMem bernouilli -68.7% +15.3% 0.092 0.092 0.0% fasta -76.4% -0.1% -0.6% -0.7% 0.0% kahan -74.0% +39.9% 0.131 0.131 0.0% primetest -67.5% +49.9% 0.050 0.050 0.0% rsa -67.9% +53.4% 0.012 0.012 0.0% scs -60.2% +6.8% +1.2% +1.2% 0.0% symalg -63.1% +9.5% 0.005 0.005 0.0% wave4main -69.5% +0.0% 0.109 0.109 0.0% }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by rwbarton): Replying to [comment:7 simonpj]:
Good! Can you give a small example of your last para? I bet it's something like {{{ case (runRW (\s -> I# x)) of (# _ , I# y) -> blah }}}
Right, so in integer-gmp, where I previously had {{{ runS :: S RealWorld a -> a -- type S s a = State# s -> (# State# s, a #) runS m = case runRW# m of (# _, a #) -> a }}} when `a = BigNat` (`data BigNat = BN# ByteArray#`), I can instead use {{{ runS_BigNat :: S RealWorld BigNat -> BigNat runS_BigNat m = case runRW# (\s -> case m s of (# s', BN# ba #) -> (# s', ba #)) of (# _, ba #) -> BN# ba }}} The idea is * `m` usually simplifies to something like {{{ \s0 -> case newByteArray# n s0 of (# s1, arr #) -> case someGmpFunction arr s1 of s2 -> case unsafeFreezeByteArray# arr s2 of (# s3, farr #) -> (# s3, BN# farr #) }}} so in the "inner case" in `runS_BigNat`, the constructor `BN#` will cancel * Now the caller of `runS_BigNat` (say, `plusBigNat`) has the CPR property because the `BN#` constructor is visible outside the `runRW#` * Its caller in turn (say, `plusInteger`) looks like {{{ plusInteger (Jp# x) (Jp# y) = Jp# (plusBigNat x y) }}} but `Jp#`'s argument is unboxed for size reasons {{{ data Integer = ... | Jp# {-# UNPACK #-} !BigNat | ... }}} so `Jp# (plusBigNat x y)` would have meant extracting the field from a `BN#` and repacking it into a `Jp#`. But by calling the unboxed worker we can avoid ever allocating an intermediate `BN#` constructor. Note that the transformation `runS -> runS_BigNat` is always valid for any `m`, even though it looks like a nested CPR situation. The reason is that the context of the `runRW#` is strict in the second argument anyways. To determine when the transformation is a good idea for a particular argument `m`, we can use regular CPR analysis on the function `\s -> case m s of (# _, r #) -> r`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonpj): Interesting. DO do this less manually we would need two things: * The expression `runRW e` has the CPR property if `e rw` has the nested CPR property. * That might lead us to w/w a function like `plusBigNat` which in turn would lead to expressions like {{{ case (runRW e) of BN# farr -> farr }}} we would need some kind of special case to push the `case` inside the `runRW` to get {{{ runRW (\s -> case e s of (# s', r #) -> case r of BN# farr -> (# s', farr #) }}} That does not look too hard. It'd be good to resurrect nested CPR (there's a ticket for that, #1600). Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonpj): Anyway let's land `runRW`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * milestone: => 8.0.1 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: patch Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D1103 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: new => patch * differential: => Phab:D1103 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: patch Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D1103 Wiki Page: | -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
integer-gmp uses an unsafePerformIO-like operation to work with mutable BigNats (unsafePerformIO and even the IO type are not yet available, since integer-gmp is a dependency of base): {{{ type S s a = State# s -> (# State# s, a #)
-- NB: equivalent of GHC.IO.unsafeDupablePerformIO, see notes there runS :: S RealWorld a -> a runS m = lazy (case m realWorld# of (# _, r #) -> r) {-# NOINLINE runS #-} }}} It's tempting to think of such an operation as "free" like an unsafeCoerce, but it is actually somewhat expensive.
Consider `plusBigNat` for instance. (Most BigNat operations have a similar structure.) {{{ plusBigNat :: BigNat -> BigNat -> BigNat plusBigNat x y | isTrue# (eqBigNatWord# x 0##) = y | isTrue# (eqBigNatWord# y 0##) = x | isTrue# (nx# >=# ny#) = go x nx# y ny# | True = go y ny# x nx# where go (BN# a#) na# (BN# b#) nb# = runS $ do mbn@(MBN# mba#) <- newBigNat# na# (W# c#) <- liftIO (c_mpn_add mba# a# na# b# nb#) case c# of 0## -> unsafeFreezeBigNat# mbn _ -> unsafeSnocFreezeBigNat# mbn c#
nx# = sizeofBigNat# x ny# = sizeofBigNat# y }}} The assembly for `go` begins {{{ 00000000000001d0
: 1d0: 49 83 c4 28 add $0x28,%r12 1d4: 4d 3b a5 58 03 00 00 cmp 0x358(%r13),%r12 1db: 77 26 ja 203 1dd: 49 c7 44 24 e0 00 00 movq $0x0,-0x20(%r12) 1e4: 00 00 1e2: R_X86_64_32S .text+0x38 1e6: 4d 89 74 24 e8 mov %r14,-0x18(%r12) 1eb: 49 89 7c 24 f0 mov %rdi,-0x10(%r12) 1f0: 49 89 74 24 f8 mov %rsi,-0x8(%r12) 1f5: 4d 89 04 24 mov %r8,(%r12) 1f9: 4d 8d 74 24 e1 lea -0x1f(%r12),%r14 1fe: e9 00 00 00 00 jmpq 203 1ff: R_X86_64_PC32 integerzmgmp_GHCziIntegerziType_runS_info-0x4 203: ... ; heap overflow }}} This allocates a 5-word closure (containing `a#`, `na#`, `b#`, `nb#`) whose code is at `.text+0x38` and passes it to `runS`, which does some `stg_ap`-y things to call back into the closure, which reads its free variables back from the heap and finally does all the real work. Altogether it's around two dozen instructions compared to if we could call directly from `go` to the argument of `runS`. The old integer-gmp somehow avoided this particular overhead by instead using the implicit "unsafePerformIO" of a foreign import prim which performed both the allocation and the addition. Is this overhead a necessary consequence of doing the work in multiple steps in Haskell?
I understand that we cannot allow everything to be inlined and, for example, the `newBigNat#` to be shared between a `plusBigNat` and `minusBigNat` with the same arguments. But once `runS` has done its job of keeping the `newBigNat#/c_mpn_add/unsafeFreeze*` together, it would be nice to eliminate it completely in the backend when compiling `go`, or any inlined version of `go`.
I'm not sure whether this should be fixed in the code generator or in integer-gmp itself. I'm also aware that this is a tricky subject but haven't really done my homework on the related tickets, so I might be missing something important!
New description:
integer-gmp uses an unsafePerformIO-like operation to work with mutable
BigNats (unsafePerformIO and even the IO type are not yet available, since
integer-gmp is a dependency of base):
{{{#!hs
type S s a = State# s -> (# State# s, a #)
-- NB: equivalent of GHC.IO.unsafeDupablePerformIO, see notes there
runS :: S RealWorld a -> a
runS m = lazy (case m realWorld# of (# _, r #) -> r)
{-# NOINLINE runS #-}
}}}
It's tempting to think of such an operation as "free" like an
unsafeCoerce, but it is actually somewhat expensive.
Consider `plusBigNat` for instance. (Most BigNat operations have a similar
structure.)
{{{#!hs
plusBigNat :: BigNat -> BigNat -> BigNat
plusBigNat x y
| isTrue# (eqBigNatWord# x 0##) = y
| isTrue# (eqBigNatWord# y 0##) = x
| isTrue# (nx# >=# ny#) = go x nx# y ny#
| True = go y ny# x nx#
where
go (BN# a#) na# (BN# b#) nb# = runS $ do
mbn@(MBN# mba#) <- newBigNat# na#
(W# c#) <- liftIO (c_mpn_add mba# a# na# b# nb#)
case c# of
0## -> unsafeFreezeBigNat# mbn
_ -> unsafeSnocFreezeBigNat# mbn c#
nx# = sizeofBigNat# x
ny# = sizeofBigNat# y
}}}
The assembly for `go` begins
{{{
00000000000001d0

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: patch Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D1103 Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonmar): * cc: simonmar (added) Old description:
integer-gmp uses an unsafePerformIO-like operation to work with mutable BigNats (unsafePerformIO and even the IO type are not yet available, since integer-gmp is a dependency of base): {{{#!hs type S s a = State# s -> (# State# s, a #)
-- NB: equivalent of GHC.IO.unsafeDupablePerformIO, see notes there runS :: S RealWorld a -> a runS m = lazy (case m realWorld# of (# _, r #) -> r) {-# NOINLINE runS #-} }}} It's tempting to think of such an operation as "free" like an unsafeCoerce, but it is actually somewhat expensive.
Consider `plusBigNat` for instance. (Most BigNat operations have a similar structure.) {{{#!hs plusBigNat :: BigNat -> BigNat -> BigNat plusBigNat x y | isTrue# (eqBigNatWord# x 0##) = y | isTrue# (eqBigNatWord# y 0##) = x | isTrue# (nx# >=# ny#) = go x nx# y ny# | True = go y ny# x nx# where go (BN# a#) na# (BN# b#) nb# = runS $ do mbn@(MBN# mba#) <- newBigNat# na# (W# c#) <- liftIO (c_mpn_add mba# a# na# b# nb#) case c# of 0## -> unsafeFreezeBigNat# mbn _ -> unsafeSnocFreezeBigNat# mbn c#
nx# = sizeofBigNat# x ny# = sizeofBigNat# y }}} The assembly for `go` begins {{{ 00000000000001d0
: 1d0: 49 83 c4 28 add $0x28,%r12 1d4: 4d 3b a5 58 03 00 00 cmp 0x358(%r13),%r12 1db: 77 26 ja 203 1dd: 49 c7 44 24 e0 00 00 movq $0x0,-0x20(%r12) 1e4: 00 00 1e2: R_X86_64_32S .text+0x38 1e6: 4d 89 74 24 e8 mov %r14,-0x18(%r12) 1eb: 49 89 7c 24 f0 mov %rdi,-0x10(%r12) 1f0: 49 89 74 24 f8 mov %rsi,-0x8(%r12) 1f5: 4d 89 04 24 mov %r8,(%r12) 1f9: 4d 8d 74 24 e1 lea -0x1f(%r12),%r14 1fe: e9 00 00 00 00 jmpq 203 1ff: R_X86_64_PC32 integerzmgmp_GHCziIntegerziType_runS_info-0x4 203: ... ; heap overflow }}} This allocates a 5-word closure (containing `a#`, `na#`, `b#`, `nb#`) whose code is at `.text+0x38` and passes it to `runS`, which does some `stg_ap`-y things to call back into the closure, which reads its free variables back from the heap and finally does all the real work. Altogether it's around two dozen instructions compared to if we could call directly from `go` to the argument of `runS`. The old integer-gmp somehow avoided this particular overhead by instead using the implicit "unsafePerformIO" of a foreign import prim which performed both the allocation and the addition. Is this overhead a necessary consequence of doing the work in multiple steps in Haskell?
I understand that we cannot allow everything to be inlined and, for example, the `newBigNat#` to be shared between a `plusBigNat` and `minusBigNat` with the same arguments. But once `runS` has done its job of keeping the `newBigNat#/c_mpn_add/unsafeFreeze*` together, it would be nice to eliminate it completely in the backend when compiling `go`, or any inlined version of `go`.
I'm not sure whether this should be fixed in the code generator or in integer-gmp itself. I'm also aware that this is a tricky subject but haven't really done my homework on the related tickets, so I might be missing something important!
New description:
integer-gmp uses an unsafePerformIO-like operation to work with mutable
BigNats (unsafePerformIO and even the IO type are not yet available, since
integer-gmp is a dependency of base):
{{{#!hs
type S s a = State# s -> (# State# s, a #)
-- NB: equivalent of GHC.IO.unsafeDupablePerformIO, see notes there
runS :: S RealWorld a -> a
runS m = lazy (case m realWorld# of (# _, r #) -> r)
{-# NOINLINE runS #-}
}}}
It's tempting to think of such an operation as "free" like an
unsafeCoerce, but it is actually somewhat expensive.
Consider `plusBigNat` for instance. (Most BigNat operations have a similar
structure.)
{{{#!hs
plusBigNat :: BigNat -> BigNat -> BigNat
plusBigNat x y
| isTrue# (eqBigNatWord# x 0##) = y
| isTrue# (eqBigNatWord# y 0##) = x
| isTrue# (nx# >=# ny#) = go x nx# y ny#
| True = go y ny# x nx#
where
go (BN# a#) na# (BN# b#) nb# = runS $ do
mbn@(MBN# mba#) <- newBigNat# na#
(W# c#) <- liftIO (c_mpn_add mba# a# na# b# nb#)
case c# of
0## -> unsafeFreezeBigNat# mbn
_ -> unsafeSnocFreezeBigNat# mbn c#
nx# = sizeofBigNat# x
ny# = sizeofBigNat# y
}}}
The assembly for `go` begins
{{{
00000000000001d0

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: patch Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D1103 Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonmar): * cc: simonmar (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive
-------------------------------------+-------------------------------------
Reporter: rwbarton | Owner:
Type: bug | Status: patch
Priority: normal | Milestone: 8.0.1
Component: Compiler | Version: 7.10.1
(CodeGen) |
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D1103
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: closed Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D1103 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: patch => closed * resolution: => fixed -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10678: integer-gmp's runS seems unnecessarily expensive -------------------------------------+------------------------------------- Reporter: rwbarton | Owner: Type: bug | Status: closed Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.10.1 (CodeGen) | Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D1103 Wiki Page: | -------------------------------------+------------------------------------- Comment (by nomeata): https://perf.haskell.org/ghc/#revision/351de169e14ad9277aaca653df4a3753c151f... looks great, good job! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC