Interesting, thanks for the link. However we're trying to do _even less_ than that - this is just the "scan the document to produce the Elias-Fano encoding" step, except without needing to keep a hold of it for later. It's not quite as trivial as that paper makes out as (a) it doesn't mention the possibility that the documents might not be well-formed, and (b) it doesn't really talk about dealing with the special delimiting characters within string literals, for which you need to drag along a bunch of state while you're scanning.

This'd be pretty trivial to do in a C-like language now we've got the DFA tables built, and I may resort to that at some point if we can't work out how to avoid the allocations in Haskell-land, but I'd quite like to be able to solve problems of this form without unnecessarily resorting to C.

Cheers,

On 11 May 2017 at 17:30, Bardur Arantsson <spam@scientician.net> wrote:
On 2017-05-11 18:12, David Turner wrote:
> Dear Café,
>
> We have a stream of ~900byte JSON documents (lazy ByteStrings) that we
> would like to validate (NB not parse) as quickly as possible. We just
> need to check that the bytes are a syntactically valid JSON document and
> then pass them elsewhere; we do not care what the content is. The
> documents are encoded in UTF-8 and have no extraneous whitespace.
>

No particular recommendations, but you might want to look into
semi-indexing[1] as a strategy. It looks plausible that it would be
possible to do that without a lot of allocation; see the paper for
details. (I there's also a demo implementation in Python on GH.)

[1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.