After playing a bit with Ryan's benchmark, I no longer think that the
order matters much for the total number of allocations. Nor do I believe
in first-class vs non-first-class IO actions. All that should matter is
how many allocations we can move to the stack. But I haven't yet figured
out why exactly different versions differ so drastically in this regard.

Yeah, it's all rather different to predict in advance, isn't it?

I tried your alternate foldrWithKey and I saw it heap allocating as well.

Further, -O0 vs. -O2 can make a big difference.  It's a little frustrating because for dealing efficiently with big data sets, especially in parallel.  It would be nice to have big-O numbers in the docs for heap allocation as well as time cost -- and ones you could trust irrespective of optimize level.

By the way, is traverse/traverseWithKey supposed to guarantee a specific order?  The doc uses this code in the definition:

    traverse ((k, v) -> (,) k $ f k v) (toList m)

And I thought "toList" didn't guarantee anything (as opposed to toAscList)...