Firstly, I think you mean Bryan's code, right? Secondly, I wrote this email five minutes before going to bed, so please excuse any inaccuracies or incoherence in it.
(3) is different to the original because it is form as `await >>= maybe`. There's a rewrite rule for this (providing the conduit 1.1 version):
{-#
RULES "await >>= maybe" forall x y. await >>= maybe x y =
ConduitM (NeedInput (unConduitM . y) (unConduitM . const x)) #-}
This gives two advantages: it avoids a costly (at least in conduit 1.1) monadic bind, and avoids creating a Maybe value that we're going to immediately throw away. This is the general idea of church encoding data structures, which is why it should give *some* speedup, but nothing significant.
Regarding fold: it *is* identical to (3). I don't think the rewrite rules are playing a big part here. You're right about needing to be careful about fold due to WHNF. However, Bryan's code already used bang patterns to force evaluation of both values, so that isn't a problem. In a real-world application, I'd *also* recommend creating a strict pair datatype to avoid accidentally leaking thunks.
But looking at the code again with fresher eyes than last night: I really don't understand why it had such abysmal performance. I'll look into this a bit more, looks like it should be interesting.