foldMapWithKey :: Monoid r => (k -> a -> r) -> M.Map k a -> r
    foldMapWithKey f = getConst . M.traverseWithKey (\k x -> Const (f k x))

Shachaf, thanks for fleshing it out.  I updated the test with your version as well:

   https://gist.github.com/rrnewton/5912908 

In short, it performs exactly the same as the foldrWithKey version, as you pointed out (32M allocation).  

In both cases, using first class monadic/applicative values seems to foil GHC.

And btw, these do show the scaling you would expect, on 2M elements, it allocates 64MB, 4M -> 128MB, and so on, whereas the traverseWithKey_ version allocates a constant amount.

  -Ryan