
Stefan O'Rear wrote:
2. Parameters are very expensive. Our type of functions that build (ignoring CPS for the time being) was MBA# -> Int# -> [ByteString], where the Int# is the current write pointer. Adding an extra Int# to cache the size of the array (rather than calling sMBA# each time) slowed the code down ~2x. Conversely, moving the write pointer into the byte array (storing it in bytes 0#, 1#, 2#, and 3#) sped the code by 4x.
If you were measuring on x86 then parameters are passed on the stack, which may be expensive. On x86_64 the first 3 arguments are passed in registers, which is usually a win, but if the function immediately does an eval they need to be saved on the stack anyway. Still, 4x sounds like a lot, perhaps you managed to avoid a stack check in the inner loop or something.
just out of curiosity: what does GHC do with stack parameters in loops/tail calls? there tends to be a conflict as the old set of parameters is needed to build the new one for the recursive call/next loop iteration, while one wants to get rid of the old set before doing the call. unless one keeps the parameter frames away from the stack, or can figure out a safe order in which to overwrite the parameters in the frame, that seems to imply some copying, even though that may be cheap for few/small parameters per frame. many years ago, i saw an abstract machine and RISC processor design aiming for good fp support that used two parameter stacks instead of one for just this reason. instead of moving stack frames around on a single stack, parameters were read from one stack, and built up on the other, followed by a cheap stack switch before the call. perhaps something like this might be of use here, too? claus