[GHC] #8885: Add inline versions of clone array primops

13 Mar 2014

      #8885: Add inline versions of clone array primops
------------------------------------+-------------------------------------
       Reporter:  tibbe             |             Owner:  simonmar
           Type:  feature request   |            Status:  new
       Priority:  normal            |         Milestone:
      Component:  Compiler          |           Version:  7.9
       Keywords:                    |  Operating System:  Unknown/Multiple
   Architecture:  Unknown/Multiple  |   Type of failure:  None/Unknown
     Difficulty:  Unknown           |         Test Case:
     Blocked By:                    |          Blocking:
Related Tickets:                    |
------------------------------------+-------------------------------------
 I've changed the clone array primops (i.e. `cloneArray#`,
 `cloneMutableArray#`, `freezeArray#`, and `thawArray#`) to use the new
 inline allocation optimization for statically known array sizes.
 Furthermore, I've moved the implementation for the non-statically known
 case out-of-line, which should reduce code size.

 The numbers are very encouraging, with the new implementation being 74%
 (i.e. almost 4x) faster than the old one. I measured this by looking at
 the total time reported by `+RTS -s` for the attached
 `InlineCloneArrayAlloc` benchmark.

 Here are the stats from the best out of three runs of the old
 implementation:

 {{{
    1,600,041,120 bytes allocated in the heap
            6,504 bytes copied during GC
           35,992 bytes maximum residency (1 sample(s))
           21,352 bytes maximum slop
             1588 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0         1 colls,     0 par    0.01s    0.01s     0.0082s
 0.0082s
   Gen  1         1 colls,     0 par    0.00s    0.11s     0.1062s
 0.1062s

   INIT    time    0.00s  (  0.00s elapsed)
   MUT     time    0.29s  (  0.57s elapsed)
   GC      time    0.01s  (  0.11s elapsed)
   EXIT    time    0.01s  (  0.11s elapsed)
   Total   time    0.31s  (  0.80s elapsed)

   %GC     time       2.7%  (14.2% elapsed)

   Alloc rate    5,497,251,856 bytes per MUT second

   Productivity  97.3% of total user, 37.4% of total elapsed
 }}}

 Here are the same for the new implementation:

 {{{
    1,600,041,120 bytes allocated in the heap
           57,224 bytes copied during GC
           35,992 bytes maximum residency (1 sample(s))
           21,352 bytes maximum slop
                1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0      3125 colls,     0 par    0.01s    0.01s     0.0000s
 0.0000s
   Gen  1         1 colls,     0 par    0.00s    0.00s     0.0003s
 0.0003s

   INIT    time    0.00s  (  0.00s elapsed)
   MUT     time    0.08s  (  0.08s elapsed)
   GC      time    0.01s  (  0.01s elapsed)
   EXIT    time    0.00s  (  0.00s elapsed)
   Total   time    0.08s  (  0.09s elapsed)

   %GC     time       6.4%  (8.8% elapsed)

   Alloc rate    21,260,179,643 bytes per MUT second

   Productivity  93.5% of total user, 88.8% of total elapsed
 }}}

 The performance ratio between the new and old implementation gets worse
 for the old implementation as the iteration count is increased.

 There's also an interesting difference in the Gen 1 collection times
 between the two implementations.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8885
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler