Interesting, thanks for the link. However we're trying to do _even less_ than that - this is just the "scan the document to produce the Elias-Fano encoding" step, except without needing to keep a hold of it for later. It's not quite as trivial as that paper makes out as (a) it doesn't mention the possibility that the documents might not be well-formed, and (b) it doesn't really talk about dealing with the special delimiting characters within string literals, for which you need to drag along a bunch of state while you're scanning.
This'd be pretty trivial to do in a C-like language now we've got the DFA tables built, and I may resort to that at some point if we can't work out how to avoid the allocations in Haskell-land, but I'd quite like to be able to solve problems of this form without unnecessarily resorting to C.
Cheers,