
Interesting, thanks for the link. However we're trying to do _even less_
than that - this is just the "scan the document to produce the Elias-Fano
encoding" step, except without needing to keep a hold of it for later. It's
not quite as trivial as that paper makes out as (a) it doesn't mention the
possibility that the documents might not be well-formed, and (b) it doesn't
really talk about dealing with the special delimiting characters within
string literals, for which you need to drag along a bunch of state while
you're scanning.
This'd be pretty trivial to do in a C-like language now we've got the DFA
tables built, and I may resort to that at some point if we can't work out
how to avoid the allocations in Haskell-land, but I'd quite like to be able
to solve problems of this form without unnecessarily resorting to C.
Cheers,
On 11 May 2017 at 17:30, Bardur Arantsson
On 2017-05-11 18:12, David Turner wrote:
Dear Café,
We have a stream of ~900byte JSON documents (lazy ByteStrings) that we would like to validate (NB not parse) as quickly as possible. We just need to check that the bytes are a syntactically valid JSON document and then pass them elsewhere; we do not care what the content is. The documents are encoded in UTF-8 and have no extraneous whitespace.
No particular recommendations, but you might want to look into semi-indexing[1] as a strategy. It looks plausible that it would be possible to do that without a lot of allocation; see the paper for details. (I there's also a demo implementation in Python on GH.)
[1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.