In short, it performs exactly the same as the foldrWithKey version, as you pointed out (32M allocation).
In both cases, using first class monadic/applicative values seems to foil GHC.
And btw, these do show the scaling you would expect, on 2M elements, it allocates 64MB, 4M -> 128MB, and so on, whereas the traverseWithKey_ version allocates a constant amount.