I made a few modifications to your code, and found that replacing `return (x3' + x4')` with `return $! x3' + x4'` reduced maximum residency down to 64kb. This forces evaluation earlier. You can see the progression here:
I took it one step further, and used the mutable-containers package to use an unboxed reference instead of the boxed STRef type. In other words: it avoids allocating a heap object. Here's that version of the code:
#!/usr/bin/env stack
-- stack --resolver lts-7.14 --install-ghc exec --package mutable-containers -- ghc -O2 -with-rtsopts=-s
import Data.Mutable
a :: Int
-> ST s Int
-> ST s Int
-> ST s Int
-> ST s Int
-> ST s Int
-> ST s Int
a k x1 x2 x3 x4 x5 =
do kk <- fmap asURef $ newRef k
let b = do k0 <- readRef kk
let k1 = k0 - 1
writeRef kk k1
a k1 b x1 x2 x3 x4
if k <= 0 then do x3' <- x3; x4' <- x4; return $! x3' + x4'
else do x5' <- x5; b' <- b; return $! x5' + b'
main = print (runST (a 22 (return 1) (return (-1)) (return (-1)) (return 1) (return 0)))
It knocked down total allocations from 2.7GB to 1.8GB, which is an improvement, but I think there's still some more low hanging fruit here.