
On 2017-05-11 18:59, David Turner wrote:
Interesting, thanks for the link. However we're trying to do _even less_ than that - this is just the "scan the document to produce the Elias-Fano encoding" step, except without needing to keep a hold of it for later. It's not quite as trivial as that paper makes out as (a) it doesn't mention the possibility that the documents might not be well-formed, and (b) it doesn't really talk about dealing with the special delimiting characters within string literals, for which you need to drag along a bunch of state while you're scanning.
My thinking was actually that it could be used reduce the allocations since everything (AFAIR, it's been a while) would basically just be linear datastructures and indexes.
This'd be pretty trivial to do in a C-like language now we've got the DFA tables built, and I may resort to that at some point if we can't work out how to avoid the allocations in Haskell-land, but I'd quite like to be able to solve problems of this form without unnecessarily resorting to C.
We'll there's always https://hackage.haskell.org/package/inline-c :) Regards,