[GHC] #10652: Better cache performance in Array#

17 Jul 2015

      #10652: Better cache performance in Array#
-------------------------------------+-------------------------------------
              Reporter:              |             Owner:
  MikeIzbicki                        |            Status:  new
                  Type:  feature     |         Milestone:
  request                            |           Version:  7.10.1
              Priority:  normal      |  Operating System:  Unknown/Multiple
             Component:  Compiler    |   Type of failure:  None/Unknown
              Keywords:              |        Blocked By:
          Architecture:              |   Related Tickets:
  Unknown/Multiple                   |
             Test Case:              |
              Blocking:              |
Differential Revisions:              |
-------------------------------------+-------------------------------------
 A common use case for arrays is to loop over them, performing an operation
 on each element of the array.  For `ByteArray#`s, this is guaranteed to
 have good cache performance:  When element `i` is accessed, the CPU's
 prefetcher will ensure that the element `i+1` is already sitting in cache,
 so the next loop will not suffer a cache miss.  Currently, the `Array#`
 and `SmallArray#` types do not significantly benefit from the prefetcher.
 This feature request asks for a few new primops that would improve the
 situation.

 My understanding is that the "raw" element in an `Array#` is actually a
 pointer to the region in memory associated with the "logical" element of
 the array.  When looping, the subsequent pointers are guaranteed to get
 prefetched, but the subsequent logical element is not guaranteed to be
 prefetched.  In particular, if we'll only get the benefits of prefetching
 on the logical elements if we're lucky enough that the memory manager
 happened to allocate these logical elements on the heap next to each
 other.  I don't know enough about GHC internals to check the source code,
 but my experiments demonstrate that this is not currently happening in my
 use case, resulting in rather significant performance degradations.  (The
 experiments are rather complicated, so I'm not attaching them.)

 I propose adding the primops:
 {{{
 packArray#      :: Array# a      -> b -> b
 packSmallArray# :: SmallArray# a -> b -> b

 packMutableArray#      :: MutableArray# s a      -> State# s -> State# s
 packSmallMutableArray# :: SmallMutableArray# s a -> State# s -> State# s
 }}}
 These operations would have the semantic effect of noops, with the
 exception that they request that the logical elements in the arrays be
 arranged adjacently in memory.  Thus, future loops over the arrays would
 benefit from CPU prefetching.  There are a number of ways the memory
 manager could handle these requests, and I don't particular care which is
 chosen.  For example, the memory manager could rearrange the memory
 immediately or wait until the next garbage collection pass and do it then.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler