[GHC] #13763: Performance regression (~34%) in 8.2.1, poor register allocation(?) in an inner loop over an array

28 May 2017

      #13763: Performance regression (~34%) in 8.2.1, poor register allocation(?) in an
inner loop over an array
-------------------------------------+-------------------------------------
           Reporter:  jberryman      |             Owner:  (none)
               Type:  bug            |            Status:  new
           Priority:  normal         |         Milestone:
          Component:  Compiler       |           Version:  8.2.1-rc2
  (NCG)                              |
           Keywords:                 |  Operating System:  Unknown/Multiple
       Architecture:                 |   Type of failure:  Runtime
  Unknown/Multiple                   |  performance bug
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 Testing GHC 8.0.1 against the RC 8.2.0.20170507

 I've distilled a smallish test-case from a much larger case in my
 'hashabler' library, and validated that the same modifications also make
 that regression disappear in the real case. It's probably possible to get
 this smaller but I don't know if I'll have time to work on it more for a
 while:

 repro3.hs:

 {{{#!hs
 {-# LANGUAGE BangPatterns #-}
 module Main(main) where

 import Criterion.Main
 import qualified Data.Primitive as P

 import Data.Bits
 import Data.Word
 import Control.DeepSeq

 main :: IO ()
 main = do
   defaultMain
     [ env (newByteArr64 5) $ \ ~bs ->
         bench "ByteArray 5" $ nf (hashTextSip 99) bs
     , env (newByteArr64 8) $ \ ~bs ->
         bench "ByteArray 8" $ nf (hashTextSip 99) bs
     , env (newByteArr64 512) $ \ ~bs ->
         bench "ByteArray 512" $ nf (hashTextSip 99) bs
     , env (newByteArr64 1000000) $ \ ~bs ->
         bench "ByteArray 1000000" $ nf (hashTextSip 99) bs
     ]

 instance NFData P.ByteArray where rnf _ = ()

 newByteArr64 n = P.newAlignedPinnedByteArray (8*n) 8 >>=
 P.unsafeFreezeByteArray

 sipRound :: Word64 -> Word64 -> Word64 -> Word64 -> (Word64, Word64,
 Word64, Word64)
 {-# INLINE sipRound #-}
 sipRound v0 v1 v2 v3 = (v3 `xor` v0, v0 `xor` v1, v1 `xor` v2, v2 `xor`
 v3)

 hashTextSip :: Word64 -> P.ByteArray -> Word64
 {-# INLINE hashTextSip #-}
 hashTextSip h = \ ba ->
     let !lenWord16 = P.sizeofByteArray ba `unsafeShiftR` 1
         !word16sRem = lenWord16 .&. 3
         !word16sIx = lenWord16-word16sRem
         !ixFinal = lenWord16-1
         !word16sIxWd = word16sIx `unsafeShiftR` 2  -- `div` 4

         hash4Word16sLoop hAcc@(!w0,!w1,!w2,!w3) !ix
             | ix == word16sIxWd = hashRemainingWord16s hAcc word16sIx
             | otherwise     =
                 let w64Dirty = P.indexByteArray ba ix
                     w64 = clean4xWord16ChunkLE w64Dirty
                  in hash4Word16sLoop (sipRound (w0 `xor` w64) w1 w2 w3)
 (ix + 1)

         -- NOTE: Removing this causes regression to disappear as well.
         hashRemainingWord16s (!w0,!w1,!w2,!w3) !ix
             | ix > ixFinal  = w0
             | otherwise     =
                 let w16 = P.indexByteArray ba ix
                  in hashRemainingWord16s (sipRound (w0 `xor` (fromIntegral
 (w16 :: Word16))) w1 w2 w3) (ix+1)
      in hash4Word16sLoop (h,1,2,3) 0

 clean4xWord16ChunkLE :: Word64 -> Word64
 {-# INLINE clean4xWord16ChunkLE #-}
 clean4xWord16ChunkLE w64Dirty =
     -- NOTE: no regression when just this (8.2 is faster)
     -- (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF)

     -- ...but this is a big regression:
     (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF)
     .|.
     (((byteSwap64 w64Dirty) `unsafeShiftL` 8) .&. 0xFF00FF00FF00FF00)
 }}}

 Here are the results of the benchmark above on my machine:

 On GHC **8.0.1**:
 {{{
 benchmarking ByteArray 5
 time                 24.70 ns   (24.00 ns .. 26.25 ns)
                      0.987 R²   (0.967 R² .. 1.000 R²)
 mean                 24.44 ns   (24.13 ns .. 25.80 ns)
 std dev              1.859 ns   (318.3 ps .. 4.227 ns)
 variance introduced by outliers: 86% (severely inflated)

 benchmarking ByteArray 8
 time                 32.66 ns   (32.58 ns .. 32.76 ns)
                      1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 32.79 ns   (32.64 ns .. 33.09 ns)
 std dev              683.7 ps   (365.4 ps .. 1.175 ns)
 variance introduced by outliers: 31% (moderately inflated)

 benchmarking ByteArray 512
 time                 1.428 μs   (1.382 μs .. 1.522 μs)
                      0.986 R²   (0.970 R² .. 1.000 R²)
 mean                 1.398 μs   (1.384 μs .. 1.454 μs)
 std dev              91.12 ns   (4.475 ns .. 193.9 ns)
 variance introduced by outliers: 76% (severely inflated)

 benchmarking ByteArray 1000000
 time                 2.658 ms   (2.653 ms .. 2.663 ms)
                      1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 2.672 ms   (2.665 ms .. 2.691 ms)
 std dev              35.00 μs   (10.88 μs .. 59.58 μs)

 }}}

 And on **GHC 8.2** RC:
 {{{
 benchmarking ByteArray 5
 time                 23.78 ns   (23.68 ns .. 23.88 ns)
                      1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 23.83 ns   (23.76 ns .. 23.95 ns)
 std dev              298.8 ps   (183.2 ps .. 482.5 ps)
 variance introduced by outliers: 14% (moderately inflated)

 benchmarking ByteArray 8
 time                 35.81 ns   (35.44 ns .. 36.27 ns)
                      0.999 R²   (0.998 R² .. 1.000 R²)
 mean                 35.56 ns   (35.45 ns .. 35.94 ns)
 std dev              596.8 ps   (134.5 ps .. 1.184 ns)
 variance introduced by outliers: 22% (moderately inflated)

 benchmarking ByteArray 512
 time                 1.706 μs   (1.698 μs .. 1.716 μs)
                      1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 1.701 μs   (1.698 μs .. 1.707 μs)
 std dev              13.27 ns   (5.825 ns .. 24.41 ns)

 benchmarking ByteArray 1000000
 time                 3.322 ms   (3.284 ms .. 3.377 ms)
                      0.999 R²   (0.998 R² .. 1.000 R²)
 mean                 3.296 ms   (3.287 ms .. 3.332 ms)
 std dev              44.62 μs   (20.55 μs .. 87.29 μs)

 }}}

 Looking at the core wasn't fruitful, but I think dumping the asm shows
 that this is a case of bad (or worse) register allocation. I've attached
 two screenshots showing the instructions added (in blue), when moving from
 the one-line `clean4xWord16ChunkLE` to the two-line version, for both 8.0
 and 8.2 (there wasn't anything in the diff besides instances of this
 change).

 It looks in the 8.2 version like we've decided we're out of registers and
 need to use the stack.

 In my real code I'm seeing 35% regression on very long Text, as well as
 21% regression on very long ByteString; the latter implementation is
 similarly structured to `hashTextSip`, but doesn't call
 `clean4xWord16ChunkLE` but does do a byteswap.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13763
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

tags

participants (1)