
On 18/04/2010 10:28, Denys Rtveliashvili wrote:
While alloca is not as cheap as, say, C's alloca, you should find that it is much quicker than C's malloc. I'm sure there's room for optimisation if it's critical for you. There may well be low-hanging fruit: take a look at the Core for alloca. Thank you, Simon.
Indeed, there is a low-hanging fruit.
"alloca"'s type is "Storable a => (Ptr a -> IO b) -> IO b" and it is not inlined even though the function is small. And calls to functions of such signature are expensive (I suppose that's because of look-up into typeclass dictionary). However, when I added an "INLINE" pragma for the function into Foreign.Marshal.Alloc the time of execution dropped from 40 to 20 nanoseconds. I guess the same effect will take place if other similar functions get marked with "INLINE".
Is there a reason why we do not want small FFI-related functions with typeclass arguments be marked with "INLINE" pragma and gain a performance improvement? The only reason that comes to my mind is the size of code, but actually the resulting code looks very small and neat.
Adding an INLINE pragma is the right thing for alloca and similar functions. alloca is a small overloaded wrapper around allocaBytesAligned, and without the INLINE pragma the body of allocaBytesAligned gets inlined into alloca itself, making it too big to be inlined at the call site (you can work around it with e.g. -funfolding-use-threshold=100). This is really a case of manual worker/wrapper: we want to tell GHC that alloca is a wrapper, and the way to do that is with INLINE. Ideally GHC would manage this itself - there's a lot of scope for doing some general code splitting, I don't think anyone has explored that yet. Cheers, Simon