Re: [GHC] #10678: integer-gmp's runS seems unnecessarily expensive

3 Nov 2015

      #10678: integer-gmp's runS seems unnecessarily expensive
-------------------------------------+-------------------------------------
        Reporter:  rwbarton          |                Owner:
            Type:  bug               |               Status:  patch
        Priority:  normal            |            Milestone:  8.0.1
       Component:  Compiler          |              Version:  7.10.1
  (CodeGen)                          |
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):  Phab:D1103
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Changes (by simonmar):

 * cc: simonmar (added)

Old description:
...
integer-gmp uses an unsafePerformIO-like operation to work with mutable
BigNats (unsafePerformIO and even the IO type are not yet available,
since integer-gmp is a dependency of base):
{{{#!hs
type S s a = State# s -> (# State# s, a #)
-- NB: equivalent of GHC.IO.unsafeDupablePerformIO, see notes there
runS :: S RealWorld a -> a
runS m = lazy (case m realWorld# of (# _, r #) -> r)
{-# NOINLINE runS #-}
}}}
It's tempting to think of such an operation as "free" like an
unsafeCoerce, but it is actually somewhat expensive.
Consider `plusBigNat` for instance. (Most BigNat operations have a
similar structure.)
{{{#!hs
plusBigNat :: BigNat -> BigNat -> BigNat
plusBigNat x y
  | isTrue# (eqBigNatWord# x 0##) = y
  | isTrue# (eqBigNatWord# y 0##) = x
  | isTrue# (nx# >=# ny#) = go x nx# y ny#
  | True                  = go y ny# x nx#
  where
    go (BN# a#) na# (BN# b#) nb# = runS $ do
        mbn@(MBN# mba#) <- newBigNat# na#
        (W# c#) <- liftIO (c_mpn_add mba# a# na# b# nb#)
        case c# of
              0## -> unsafeFreezeBigNat# mbn
              _   -> unsafeSnocFreezeBigNat# mbn c#
nx# = sizeofBigNat# x
    ny# = sizeofBigNat# y
}}}
The assembly for `go` begins
{{{
00000000000001d0 :
 1d0:   49 83 c4 28             add    $0x28,%r12
 1d4:   4d 3b a5 58 03 00 00    cmp    0x358(%r13),%r12
 1db:   77 26                   ja     203

 1dd:   49 c7 44 24 e0 00 00    movq   $0x0,-0x20(%r12)
 1e4:   00 00
                        1e2: R_X86_64_32S       .text+0x38
 1e6:   4d 89 74 24 e8          mov    %r14,-0x18(%r12)
 1eb:   49 89 7c 24 f0          mov    %rdi,-0x10(%r12)
 1f0:   49 89 74 24 f8          mov    %rsi,-0x8(%r12)
 1f5:   4d 89 04 24             mov    %r8,(%r12)
 1f9:   4d 8d 74 24 e1          lea    -0x1f(%r12),%r14
 1fe:   e9 00 00 00 00          jmpq   203

                        1ff: R_X86_64_PC32
integerzmgmp_GHCziIntegerziType_runS_info-0x4
 203:   ...                     ; heap overflow
}}}
This allocates a 5-word closure (containing `a#`, `na#`, `b#`, `nb#`)
whose code is at `.text+0x38` and passes it to `runS`, which does some
`stg_ap`-y things to call back into the closure, which reads its free
variables back from the heap and finally does all the real work.
Altogether it's around two dozen instructions compared to if we could
call directly from `go` to the argument of `runS`.
The old integer-gmp somehow avoided this particular overhead by instead
using the implicit "unsafePerformIO" of a foreign import prim which
performed both the allocation and the addition. Is this overhead a
necessary consequence of doing the work in multiple steps in Haskell?
I understand that we cannot allow everything to be inlined and, for
example, the `newBigNat#` to be shared between a `plusBigNat` and
`minusBigNat` with the same arguments. But once `runS` has done its job
of keeping the `newBigNat#/c_mpn_add/unsafeFreeze*` together, it would be
nice to eliminate it completely in the backend when compiling `go`, or
any inlined version of `go`.
I'm not sure whether this should be fixed in the code generator or in
integer-gmp itself. I'm also aware that this is a tricky subject but
haven't really done my homework on the related tickets, so I might be
missing something important!
New description:

 integer-gmp uses an unsafePerformIO-like operation to work with mutable
 BigNats (unsafePerformIO and even the IO type are not yet available, since
 integer-gmp is a dependency of base):
 {{{#!hs
 type S s a = State# s -> (# State# s, a #)
 -- NB: equivalent of GHC.IO.unsafeDupablePerformIO, see notes there
 runS :: S RealWorld a -> a
 runS m = lazy (case m realWorld# of (# _, r #) -> r)
 {-# NOINLINE runS #-}
 }}}
 It's tempting to think of such an operation as "free" like an
 unsafeCoerce, but it is actually somewhat expensive.
 Consider `plusBigNat` for instance. (Most BigNat operations have a similar
 structure.)
 {{{#!hs
 plusBigNat :: BigNat -> BigNat -> BigNat
 plusBigNat x y
 | isTrue# (eqBigNatWord# x 0##) = y
 | isTrue# (eqBigNatWord# y 0##) = x
 | isTrue# (nx# >=# ny#) = go x nx# y ny#
 | True = go y ny# x nx#
 where
 go (BN# a#) na# (BN# b#) nb# = runS $ do
 mbn@(MBN# mba#) <- newBigNat# na#
 (W# c#) <- liftIO (c_mpn_add mba# a# na# b# nb#)
 case c# of
 0## -> unsafeFreezeBigNat# mbn
 _ -> unsafeSnocFreezeBigNat# mbn c#
 nx# = sizeofBigNat# x
 ny# = sizeofBigNat# y
 }}}
 The assembly for `go` begins
 {{{
 00000000000001d0 :
 1d0: 49 83 c4 28 add $0x28,%r12
 1d4: 4d 3b a5 58 03 00 00 cmp 0x358(%r13),%r12
 1db: 77 26 ja 203 
 1dd: 49 c7 44 24 e0 00 00 movq $0x0,-0x20(%r12)
 1e4: 00 00
 1e2: R_X86_64_32S .text+0x38
 1e6: 4d 89 74 24 e8 mov %r14,-0x18(%r12)
 1eb: 49 89 7c 24 f0 mov %rdi,-0x10(%r12)
 1f0: 49 89 74 24 f8 mov %rsi,-0x8(%r12)
 1f5: 4d 89 04 24 mov %r8,(%r12)
 1f9: 4d 8d 74 24 e1 lea -0x1f(%r12),%r14
 1fe: e9 00 00 00 00 jmpq 203

 1ff: R_X86_64_PC32 integerzmgmp_GHCziIntegerziType_runS_info-0x4
 203: ... ; heap overflow
 }}}
 This allocates a 5-word closure (containing `a#`, `na#`, `b#`, `nb#`)
 whose code is at `.text+0x38` and passes it to `runS`, which does some
 `stg_ap`-y things to call back into the closure, which reads its free
 variables back from the heap and finally does all the real work.
 Altogether it's around two dozen instructions compared to if we could call
 directly from `go` to the argument of `runS`.
 The old integer-gmp somehow avoided this particular overhead by instead
 using the implicit "unsafePerformIO" of a foreign import prim which
 performed both the allocation and the addition. Is this overhead a
 necessary consequence of doing the work in multiple steps in Haskell?
 I understand that we cannot allow everything to be inlined and, for
 example, the `newBigNat#` to be shared between a `plusBigNat` and
 `minusBigNat` with the same arguments. But once `runS` has done its job of
 keeping the `newBigNat#/c_mpn_add/unsafeFreeze*` together, it would be
 nice to eliminate it completely in the backend when compiling `go`, or any
 inlined version of `go`.
 I'm not sure whether this should be fixed in the code generator or in
 integer-gmp itself. I'm also aware that this is a tricky subject but
 haven't really done my homework on the related tickets, so I might be
 missing something important!

--

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10678#comment:15
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler