
Hi, Daniel Peebles has done a great job adding new primops for copying arrays (both Array# and MutableArray#). This gave me a nice speedup for my hash-array mapped trie (HAMT) data structure, which is now about as fast as Data.Map for inserts but about 5.5x faster for lookups. (It should also have a lower per key/value overhead.) However, for the HAMT is more than twice as slow for inserts compared to IntMap and copying arrays is still the bottleneck. C compilers, like gcc, go to great lengths making memcpy fast and I was thinking that we might be able to steal a trick or two from them. I'd like some feedback on these ideas: * Could we try to make the array primops inline? If so, would that mean that if e.g. the number of elements to copy is known at the (Haskell) call site it would also be known in C-- and we could get the benefits of knowing that? * Could we use built-in compiler rules to catch array copies of known length and replace them with e.g. unrolled loops? My particular use case involves copying small arrays (size: 1-32). Ideally this should be as fast as copying a tuple of the corresponding size but I'm pretty sure we're far off that goal. * Can we use SSE instructions? * Can we get the C memcpy code inlined into the C-- source (crazy, I know). If so we could perhaps benefit directly from optimizations in libc. Johan