[GHC] #10652: Better cache performance in Array#

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: | Owner: MikeIzbicki | Status: new Type: feature | Milestone: request | Version: 7.10.1 Priority: normal | Operating System: Unknown/Multiple Component: Compiler | Type of failure: None/Unknown Keywords: | Blocked By: Architecture: | Related Tickets: Unknown/Multiple | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- A common use case for arrays is to loop over them, performing an operation on each element of the array. For `ByteArray#`s, this is guaranteed to have good cache performance: When element `i` is accessed, the CPU's prefetcher will ensure that the element `i+1` is already sitting in cache, so the next loop will not suffer a cache miss. Currently, the `Array#` and `SmallArray#` types do not significantly benefit from the prefetcher. This feature request asks for a few new primops that would improve the situation. My understanding is that the "raw" element in an `Array#` is actually a pointer to the region in memory associated with the "logical" element of the array. When looping, the subsequent pointers are guaranteed to get prefetched, but the subsequent logical element is not guaranteed to be prefetched. In particular, if we'll only get the benefits of prefetching on the logical elements if we're lucky enough that the memory manager happened to allocate these logical elements on the heap next to each other. I don't know enough about GHC internals to check the source code, but my experiments demonstrate that this is not currently happening in my use case, resulting in rather significant performance degradations. (The experiments are rather complicated, so I'm not attaching them.) I propose adding the primops: {{{ packArray# :: Array# a -> b -> b packSmallArray# :: SmallArray# a -> b -> b packMutableArray# :: MutableArray# s a -> State# s -> State# s packSmallMutableArray# :: SmallMutableArray# s a -> State# s -> State# s }}} These operations would have the semantic effect of noops, with the exception that they request that the logical elements in the arrays be arranged adjacently in memory. Thus, future loops over the arrays would benefit from CPU prefetching. There are a number of ways the memory manager could handle these requests, and I don't particular care which is chosen. For example, the memory manager could rearrange the memory immediately or wait until the next garbage collection pass and do it then. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonpj): Getting better cache behaviour would be a good thing! I don't know the impl in detail either, but I'm pretty sure that all you need to do is to use unboxed arrays. Certainly before even beginning to think about new primops we need to understand ''in detail'' what the current implementation does, and why it has poor cache behaviour. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by MikeIzbicki): In my specific use case, unboxed arrays are not possible because each element does not have a fixed size (think something like `Integer`). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by carter): Mike: have you looked at how the gc arranges values only reachable via a boxed array? I guess one problem I can think of off hand is that your primops get tricky when a value is shared across Multiple boxed arrays! A valid and interesting idea would be to just make sure that the gc places unshared values reachable only via a boxed array continuously when doing a copying gc. I should try to look into this myself. Relatedly: bob Harper et all did prove a few years ago that certain gc strategies do result in cache optimal memory layouts. But I don't know if ghc does a traversal order that maps to their specific proof. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by MikeIzbicki): Replying to [comment:3 carter]:
Mike: have you looked at how the gc arranges values only reachable via a boxed array?
I guess one problem I can think of off hand is that your primops get
A valid and interesting idea would be to just make sure that the gc
I wouldn't even know where to look for this. My only evidence that things aren't aligned is the huge number of cache misses I'm getting. But I suppose these cache misses could be caused by something else. tricky when a value is shared across Multiple boxed arrays! My proposal doesn't run into this problem. I'm specifically not proposing that GHC try to find a "globally optimal" memory allocation scheme. Instead, all I want is that immediately after these operations are called memory is aligned. A subsequent call to these primops on another array with the same elements in a different order would destroy that alignment. places unshared values reachable only via a boxed array continuously when doing a copying gc. I should try to look into this myself. I think this would work for me, and probably be better in general performance-wise. I was just (maybe naively) imagining this would be harder to implement. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by carter): I actually think that it'd be easier/safe space safety wise to provide the latter (eg, preserve sharing!). your proposed alternative operations requires doing a GC style movement of the values held in the underlying array to have the right semantics I'm not sure where to look in the GC myself, but i think https://github.com/ghc/ghc/blob/master/rts/sm/Evac.c#L711-L725 is probably a good starting point -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by carter): roughly: as long as the GC does a *breadth first* copy of the values reachable through a boxed array, in left to right order, you should have the behavior you want. i *believe* ghc roughly does this in a depth first order, but i could be wrong likewise, https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/GC/Copying provides some more information about the rough flavor of the copying algorithms. one challenge might be that the underlying storage layer of the GC is block oriented, and i believe the GC can promote a block of memory between generations instead of just copying -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by rwbarton): If the values pointed to by your array don't themselves contain pointers (and assuming nothing else on the heap points to them) then either a BFS or a DFS traversal will visit them all sequentially. But in many cases if your values don't contain pointers then you could find a way to store them in an unboxed array with higher memory density anyways. (Using a boxed array costs you two words per element up front, one for the pointer in the array and one for the info pointer of the If your values do themselves contain pointers then it does matter whether the traversal is BFS or DFS, but which one you want depends on how you plan to use the values, which the GC can't know without further hints. For example a large Integer contains a pointer to a ByteArray# containing the actual integer data, which you will need for most (but not all) operations on the Integer. If you have an array with a lot of large Integers, you may or may not want to put the ByteArray# after its corresponding Integer (even ignoring the possibility of the ByteArray#s being shared). My potential concerns about the original feature request are - It sounds like a lot of extra complexity in the GC - Existing programs gain nothing from this added complexity unless they are modified to use the new primops - Consequently the implementation of the new primops must have zero cost for programs that do not use them - A fairly narrow range of programs stand to benefit from the new primops - It's unclear how much there is to be gained in the best case, especially compared to user-space alternatives like explicitly prefetching the next value of the array before processing the current one You might be better off just implementing your own memory management (e.g. allocate variable-length representations of your data sequentially from a mutable ByteArray, then store offsets into this storage in your array and access the data through short-lived Haskell values, Storable-style). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thomie): * failure: None/Unknown => Runtime performance bug -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by lelf): * cc: lelf (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by thomie): See this [https://izbicki.me/blog/fast-nearest-neighbor-queries-in- haskell.html blog post] for a use case for this feature, towards the end of "Lesson 4: Haskell’s standard libraries were not designed for fast numeric computing." -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by bgamari): * cc: bgamari (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by rwbarton): So to simplify, the use case is something like {{{ data IntArrayTree = IntArrayTree { value :: {-# UNPACK #-} !Int, children :: {-# UNPACK #-} !(Array Int IntArrayTree) } }}} and arguably the underlying problem that would be nice to fix is that there are two levels of indirection (`IntArrayTree` -> `Array#` -> `IntArrayTree`) per level of the tree. Of note is that the `IntArrayTree` values themselves are actually of constant size. But we can't store them efficiently in any sort of array because they contain a mix of pointer and non-pointer fields. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by tibbe): Getting rid of the kind of array indirection rwbarton mentioned would greatly help unordered-containers. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by rwbarton): One idea I had was to create a new family of array types along with boxed and unboxed arrays, called unpacked arrays. (Note this is the opposite of UnpackingArrays.) The idea is that whenever `T` is a type that can be unpacked with the `{-# UNPACK #-}` pragma, we can form values of type `UnpackedArray# T`. The heap representation of such a value is of the form {{{ UnpackedArray#_info T_con_info n a1 b1 c1 a2 b2 c2 ... an bn cn }}} representing an array containing the values {{{ T_con_info a1 b1 c1 T_con_info a2 b2 c2 ... T_con_info an bn cn }}} (Since `T` can be unpacked, it has just one constructor.) Compared to an `Array#` this costs one extra word `T_con_info`. It might be possible to save this word with a new type of info table layout, then the heap representation of `Array# a` would be identical (modulo info pointer value) to that of `UnpackedArray# (Box a)` where `data Box a = Box a`. The GC can use `T_con_info` to understand the layout of the rest of the heap object, while mutator operations like indexing would use operations from a magically-generated `Unpack` type class. This last part might be tricky. Then `IntArrayTree` could hold an `UnpackedArray# IntArrayTree` to avoid one of the layers of indirection. It won't apply to tibbe's unordered- containers though, since `IntMap` has multiple constructors. (I guess it could be used inside `Collision !Hash !(A.Array (Leaf k v))`, but that's rather pointless.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by rwbarton): MikeIzbicki, you will probably be interested in Phab:D1264. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10652: Better cache performance in Array# -------------------------------------+------------------------------------- Reporter: MikeIzbicki | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by carter): This seems to overlap with the array tech that Edward Kmett has demoed for mutable data structures and the resulting conversations that ensured on the mailing lists and at icfp 2015. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10652#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC