Faster Array#/MutableArray# copies

Hi, Daniel Peebles has done a great job adding new primops for copying arrays (both Array# and MutableArray#). This gave me a nice speedup for my hash-array mapped trie (HAMT) data structure, which is now about as fast as Data.Map for inserts but about 5.5x faster for lookups. (It should also have a lower per key/value overhead.) However, for the HAMT is more than twice as slow for inserts compared to IntMap and copying arrays is still the bottleneck. C compilers, like gcc, go to great lengths making memcpy fast and I was thinking that we might be able to steal a trick or two from them. I'd like some feedback on these ideas: * Could we try to make the array primops inline? If so, would that mean that if e.g. the number of elements to copy is known at the (Haskell) call site it would also be known in C-- and we could get the benefits of knowing that? * Could we use built-in compiler rules to catch array copies of known length and replace them with e.g. unrolled loops? My particular use case involves copying small arrays (size: 1-32). Ideally this should be as fast as copying a tuple of the corresponding size but I'm pretty sure we're far off that goal. * Can we use SSE instructions? * Can we get the C memcpy code inlined into the C-- source (crazy, I know). If so we could perhaps benefit directly from optimizations in libc. Johan

On 18 February 2011 01:18, Johan Tibell
C compilers, like gcc, go to great lengths making memcpy fast and I was thinking that we might be able to steal a trick or two from them. I'd like some feedback on these ideas:
It seems like a sufficient solution for your needs would be for us to use the LTO support in LLVM to inline across module boundaries - in particular to inline primop implementations into their call sites. LLVM would then probably deal with unrolling small loops with statically known bounds. I don't think this would require a major change to GHC, though LTO would only work with the Gold linker (which only supports ELF) at the moment :-( Cheers, Max

Max Bolingbroke wrote:
On 18 February 2011 01:18, Johan Tibell
wrote: It seems like a sufficient solution for your needs would be for us to use the LTO support in LLVM to inline across module boundaries - in particular to inline primop implementations into their call sites. LLVM would then probably deal with unrolling small loops with statically known bounds.
Could we simply use this? http://llvm.org/docs/LangRef.html#int_memcpy Roman

On Fri, Feb 18, 2011 at 12:54 AM, Roman Leshchinskiy
Max Bolingbroke wrote:
On 18 February 2011 01:18, Johan Tibell
wrote:> It seems like a sufficient solution for your needs would be for us to use the LTO support in LLVM to inline across module boundaries - in particular to inline primop implementations into their call sites. LLVM would then probably deal with unrolling small loops with statically known bounds. Could we simply use this?
Might be easier to implement a PrimOp inlining pass, and to run it before LLVM's built-in MemCpyOptimization pass [0]. This wouldn't generally be as good as LTO but would work without gold. [0] http://llvm.org/doxygen/MemCpyOptimizer_8cpp_source.html

On 18/02/2011 19:42, Nathan Howell wrote:
On Fri, Feb 18, 2011 at 12:54 AM, Roman Leshchinskiy
mailto:rl@cse.unsw.edu.au> wrote: Max Bolingbroke wrote: > On 18 February 2011 01:18, Johan Tibell
mailto:johan.tibell@gmail.com> wrote:> It seems like a sufficient solution for your needs would be for us to > use the LTO support in LLVM to inline across module boundaries - in > particular to inline primop implementations into their call sites. LLVM > would then probably deal with unrolling small loops with statically known > bounds. Could we simply use this?
http://llvm.org/docs/LangRef.html#int_memcpy
Might be easier to implement a PrimOp inlining pass, and to run it before LLVM's built-in MemCpyOptimization pass [0]. This wouldn't generally be as good as LTO but would work without gold.
[0] http://llvm.org/doxygen/MemCpyOptimizer_8cpp_source.html
Ideally you'd want the heap check in the primop to be aggregated into the calling function's heap check, and the primop should allocate directly from the heap instead of calling out to the RTS allocate(). All this is a bit much to expect LLVM to do, but we could do it in the Glorious New Code Generator... For small arrays like this maybe we should have a new array type that leaves out all the card-marking stuff too (or just use tuples, as Roman suggested). Cheers, Simon

On Mon, Feb 28, 2011 at 9:01 AM, Simon Marlow
On 18/02/2011 19:42, Nathan Howell wrote:
On Fri, Feb 18, 2011 at 12:54 AM, Roman Leshchinskiy
mailto:rl@cse.unsw.edu.au> wrote: Max Bolingbroke wrote: > On 18 February 2011 01:18, Johan Tibell
mailto:johan.tibell@gmail.com> wrote:> It seems like a sufficient solution for your needs would be for us to > use the LTO support in LLVM to inline across module boundaries - in > particular to inline primop implementations into their call sites. LLVM > would then probably deal with unrolling small loops with statically known > bounds. Could we simply use this?
http://llvm.org/docs/LangRef.html#int_memcpy
Might be easier to implement a PrimOp inlining pass, and to run it before LLVM's built-in MemCpyOptimization pass [0]. This wouldn't generally be as good as LTO but would work without gold.
[0] http://llvm.org/doxygen/MemCpyOptimizer_8cpp_source.html
Ideally you'd want the heap check in the primop to be aggregated into the calling function's heap check, and the primop should allocate directly from the heap instead of calling out to the RTS allocate(). All this is a bit much to expect LLVM to do, but we could do it in the Glorious New Code Generator...
For small arrays like this maybe we should have a new array type that leaves out all the card-marking stuff too (or just use tuples, as Roman suggested).
I might try to use tuples directly. It would be very ugly though as I would need a sum type of 32 different tuple sizes. Johan

On Mon, Feb 28, 2011 at 6:29 PM, Johan Tibell
On Mon, Feb 28, 2011 at 9:01 AM, Simon Marlow
wrote: On 18/02/2011 19:42, Nathan Howell wrote:
On Fri, Feb 18, 2011 at 12:54 AM, Roman Leshchinskiy
mailto:rl@cse.unsw.edu.au> wrote: Max Bolingbroke wrote: > On 18 February 2011 01:18, Johan Tibell
mailto:johan.tibell@gmail.com> wrote:> It seems like a sufficient solution for your needs would be for us to > use the LTO support in LLVM to inline across module boundaries - in > particular to inline primop implementations into their call sites. LLVM > would then probably deal with unrolling small loops with statically known > bounds. Could we simply use this?
http://llvm.org/docs/LangRef.html#int_memcpy
Might be easier to implement a PrimOp inlining pass, and to run it before LLVM's built-in MemCpyOptimization pass [0]. This wouldn't generally be as good as LTO but would work without gold.
[0] http://llvm.org/doxygen/MemCpyOptimizer_8cpp_source.html
Ideally you'd want the heap check in the primop to be aggregated into the calling function's heap check, and the primop should allocate directly from the heap instead of calling out to the RTS allocate(). All this is a bit much to expect LLVM to do, but we could do it in the Glorious New Code Generator...
For small arrays like this maybe we should have a new array type that leaves out all the card-marking stuff too (or just use tuples, as Roman suggested).
I might try to use tuples directly. It would be very ugly though as I would need a sum type of 32 different tuple sizes.
You can't random-access the elements of a tuple based on an Int though, can you? You'd have to do a switch statement (pattern match) every time you wanted to access a specific element. Unless I'm missing something.
Johan
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
-- Work is punishment for failing to procrastinate effectively.

Simon Marlow wrote:
For small arrays like this maybe we should have a new array type that leaves out all the card-marking stuff too (or just use tuples, as Roman suggested).
Would it, in theory, be possible to have an "unpacked" array type? That is, could we have constructors for which the length of the closure is determined dynamically at runtime? Roman

On 01/03/2011 11:55, Roman Leshchinskiy wrote:
Simon Marlow wrote:
For small arrays like this maybe we should have a new array type that leaves out all the card-marking stuff too (or just use tuples, as Roman suggested).
Would it, in theory, be possible to have an "unpacked" array type? That is, could we have constructors for which the length of the closure is determined dynamically at runtime?
Certainly, but the amount of effort to implement depends on what you want to support. e.g. do you want to support {-# UNPACK #-} on primitive array types in a constructor field? That's probably quite hard. I believe Duncan Coutts has been thinking along similar lines, we talked about it once. Or were you thinking of something more restricted? Cheers, Simon

Simon Marlow wrote:
On 01/03/2011 11:55, Roman Leshchinskiy wrote:
Would it, in theory, be possible to have an "unpacked" array type? That is, could we have constructors for which the length of the closure is determined dynamically at runtime?
Certainly, but the amount of effort to implement depends on what you want to support. e.g. do you want to support {-# UNPACK #-} on primitive array types in a constructor field? That's probably quite hard. I believe Duncan Coutts has been thinking along similar lines, we talked about it once.
I can see that supporting this would be rather hard: data T a = T {-# UNPACK #-} (Array# a) We would have to allow Array# to point to the middle of a closure and it's far from obvious how to initialise this since we don't have Array# literals.
Or were you thinking of something more restricted?
Yes, I was thinking of some special syntax. Something along these lines: data T a = T {a} f x = T {x x x} g (T {x y z}) = x h (T xs) = xs{0} I'm not seriously suggesting this syntax, this is just to demonstrate the general idea. In the last function, it shouldn't be possible to do anything with xs except indexing and taking the length. This would be much easier, right? Of course, we would also want this for byte arrays... Roman

On Mon, Feb 28, 2011 at 9:01 AM, Simon Marlow
Ideally you'd want the heap check in the primop to be aggregated into the calling function's heap check, and the primop should allocate directly from the heap instead of calling out to the RTS allocate(). All this is a bit much to expect LLVM to do, but we could do it in the Glorious New Code Generator...
It's probably (certainly) better to do it in the code generator but it is rather easy to perform in LLVM and might be a worthwhile optimization during LTO/LTCG.

Johan Tibell wrote:
* Could we use built-in compiler rules to catch array copies of known length and replace them with e.g. unrolled loops? My particular use case involves copying small arrays (size: 1-32). Ideally this should be as fast as copying a tuple of the corresponding size but I'm pretty sure we're far off that goal.
Out of idle curiousity, couldn't you use tuples instead of arrays? FWIW, I agree that doing something cleverer than just calling memcpy could be very worthwhile. As Max points out, you could perhaps try to do something with the LLVM backend. Roman

On February 17, 2011 20:18:11 Johan Tibell wrote:
* Can we use SSE instructions?
* Can we get the C memcpy code inlined into the C-- source (crazy, I know). If so we could perhaps benefit directly from optimizations in libc.
From the standpoint of numerical code, it would be very nice to get some vector . Perhaps this would also give a natural expression of memcpy. In the spirit of the common "infinite" register file assumption, I've always imagined idealized arbitrary-length vector types/operations at the C-- level (mapped to reality via chunking it over fixed length vector instructions). I see this is LLVM supports arbitrary length vectors as a first class type http://llvm.org/docs/LangRef.html#t_vector http://llvm.org/docs/LangRef.html#i_add Looking at the C-- specification (section 2.4 -- page 10) http://www.cminusminus.org/extern/man2.pdf it seems this may not be such a good fit as it considers everything as fixed- size bit collections (bits8, bits16, etc.) and booleans (bool). Presumably memory orientated primitive instructions are also out as they break the strict load/store architecture. Has anyone though of how to do this? Some half baked suggestions - perhaps it should be fixed-size bit collections with repetition, or - built in primitives for instructions composition (i.e., a map operation)? Cheers! -Tyson
participants (7)
-
Gábor Lehel
-
Johan Tibell
-
Max Bolingbroke
-
Nathan Howell
-
Roman Leshchinskiy
-
Simon Marlow
-
Tyson Whitehead