Is it possible to have constant-space JSON decoding?

oleg＠okmij.org

5 Dec 2012 5 Dec '12

5:13 a.m.

I am doing, for several months, constant-space processing of large XML files using iteratees. The file contains many XML elements (which are a bit complex than a number). An element can be processed independently. After the parser finished with one element, and dumped the related data, the processing of the next element starts anew, so to speak. No significant state is accumulated for the overall parsing sans the counters of processed and bad elements, for statistics. XML is somewhat like JSON, only more complex: an XML parser has to deal with namespaces, parsed entities, CDATA sections and the other interesting stuff. Therefore, I'm quite sure there should not be fundamental problems in constant-space parsing of JSON. BTW, the parser itself is described there http://okmij.org/ftp/Streams.html#xml

Show replies by date

Johan Tibell

5 Dec 5 Dec

5:37 a.m.

New subject: Is it possible to have constant-space JSON decoding?

Hi Oleg, On Tue, Dec 4, 2012 at 9:13 PM, wrote:

...

I am doing, for several months, constant-space processing of large XML files using iteratees. The file contains many XML elements (which are a bit complex than a number). An element can be processed independently. After the parser finished with one element, and dumped the related data, the processing of the next element starts anew, so to speak. No significant state is accumulated for the overall parsing sans the counters of processed and bad elements, for statistics. XML is somewhat like JSON, only more complex: an XML parser has to deal with namespaces, parsed entities, CDATA sections and the other interesting stuff. Therefore, I'm quite sure there should not be fundamental problems in constant-space parsing of JSON.

BTW, the parser itself is described there http://okmij.org/ftp/Streams.html#xml

It certainly is possible (using a SAX style parser). What you can't have (I think) is a function: decode :: FromJSON a => ByteString -> Maybe a and constant-memory parsing at the same time. The return type here says that we will return Nothing if parsing fails. We can only do so after looking at the whole input (otherwise how would we know if it's malformed). The use cases aeson was designed for (which I bet is the majority use case) is parsing smaller messages sent over the network (i.e. in web service APIs) so this is the only mode of parsing it supplies. -- Johan

Jason Dagit

5:48 a.m.

New subject: Is it possible to have constant-space JSON decoding?

On Tue, Dec 4, 2012 at 9:37 PM, Johan Tibell wrote:

...

Hi Oleg,

On Tue, Dec 4, 2012 at 9:13 PM, wrote:

...
I am doing, for several months, constant-space processing of large XML files using iteratees. The file contains many XML elements (which are a bit complex than a number). An element can be processed independently. After the parser finished with one element, and dumped the related data, the processing of the next element starts anew, so to speak. No significant state is accumulated for the overall parsing sans the counters of processed and bad elements, for statistics. XML is somewhat like JSON, only more complex: an XML parser has to deal with namespaces, parsed entities, CDATA sections and the other interesting stuff. Therefore, I'm quite sure there should not be fundamental problems in constant-space parsing of JSON.

BTW, the parser itself is described there http://okmij.org/ftp/Streams.html#xml

It certainly is possible (using a SAX style parser). What you can't have (I think) is a function:

decode :: FromJSON a => ByteString -> Maybe a

and constant-memory parsing at the same time. The return type here says that we will return Nothing if parsing fails. We can only do so after looking at the whole input (otherwise how would we know if it's malformed).

I thought it was possible to get around this with lazy patterns such "Wadler's force function" [1]? (untested code) force y = let Just x = y in Just x lazyDecode :: FromJSON a => ByteString -> Maybe a lazyDecode = force . decode [1] http://www.haskell.org/haskellwiki/Maintaining_laziness

Albert Y. C. Lai

7 Dec 7 Dec

11:06 p.m.

New subject: Is it possible to have constant-space JSON decoding?

On 12-12-05 12:48 AM, Jason Dagit wrote:

...

I thought it was possible to get around this with lazy patterns such "Wadler's force function" [1]?

(untested code)

force y = let Just x = y in Just x

lazyDecode :: FromJSON a => ByteString -> Maybe a lazyDecode = force . decode

This says, the type is Maybe, but the value is always Just. If there will be a parse error, you will get an async imprecise exception, not Nothing, at the later time when you look closer. This respects the letter, but not the point, of the Maybe type. If you are to do this (I have no moral, I hold nothing against doing this, the async imprecise exception will give you enough grief), you may as well skip the Just wrapping and directly go ByteString -> a. The point of the Maybe type is to communicate parse errors by a less async, less imprecise way.

Malcolm Wallace

9:57 a.m.

New subject: Is it possible to have constant-space JSON decoding?

See also the incremental XML parser in HaXml, described in "Partial parsing: combining choice with commitment", IFL 2006. It has constant space usage (for some patterns of usage), even with extremely large inputs. http://www.google.co.uk/url?sa=t&rct=j&q=malcolm+wallace+partial+parsing&source=web&cd=2&ved=0CEEQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.135.7512%26rep%3Drep1%26type%3Dpdf&ei=Db3BUNmiOsfS4QTAkoDYAw&usg=AFQjCNHHywUCvaFv8eBoQ-x9jj4GOMHo2w On 5 Dec 2012, at 05:37, Johan Tibell wrote:

...

Hi Oleg,

On Tue, Dec 4, 2012 at 9:13 PM, wrote:

...
I am doing, for several months, constant-space processing of large XML files using iteratees. The file contains many XML elements (which are a bit complex than a number). An element can be processed independently. After the parser finished with one element, and dumped the related data, the processing of the next element starts anew, so to speak. No significant state is accumulated for the overall parsing sans the counters of processed and bad elements, for statistics. XML is somewhat like JSON, only more complex: an XML parser has to deal with namespaces, parsed entities, CDATA sections and the other interesting stuff. Therefore, I'm quite sure there should not be fundamental problems in constant-space parsing of JSON.

BTW, the parser itself is described there http://okmij.org/ftp/Streams.html#xml

It certainly is possible (using a SAX style parser). What you can't have (I think) is a function:

decode :: FromJSON a => ByteString -> Maybe a

and constant-memory parsing at the same time. The return type here says that we will return Nothing if parsing fails. We can only do so after looking at the whole input (otherwise how would we know if it's malformed).

The use cases aeson was designed for (which I bet is the majority use case) is parsing smaller messages sent over the network (i.e. in web service APIs) so this is the only mode of parsing it supplies.

-- Johan

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

oleg＠okmij.org

15 Dec 15 Dec

3:02 p.m.

New subject: Is it possible to have constant-space JSON decoding?

Johan Tibell posed an interesting problem of incremental XML parsing while still detecting and reporting ill-formedness errors.

...

What you can't have (I think) is a function:

decode :: FromJSON a => ByteString -> Maybe a

and constant-memory parsing at the same time. The return type here says that we will return Nothing if parsing fails. We can only do so after looking at the whole input (otherwise how would we know if it's malformed).

The problem is very common: suppose we receive an XML document over a network (e.g., in an HTTP stream). Network connections are inherently unreliable and can be dropped at any time (e.g., because someone tripped over a power cord). The XML document therefore can come truncated, for example, missing the end tag for the root element. According to the XML Recommendations, such document is ill-formed, and hence does not have an Infoset (In contrast, invalid but well-formed documents do have the Infoset). Strictly speaking, we should not be processing an XML document until we verified that it is well-formed, that is, until we parsed it at all and have checked that all end tags are in place. It seems we can't do the incremental XML processing in principle. I should mention first that sometimes people avoid such a strict interpretation. If we process telemetry data encoded in XML, we may consider a document meaningful even if the root end tag is missing. We process as far as we can. Even if we take the strict interpretation, it is still possible to handle a document incrementally so long as the processing is functional or the side effects can be backed out (e.g., in a transaction). This message illustrates exactly such an incremental processing that always detects ill-formed XML, and, optionally, invalid XML. It is possible after all to detect ill-formedness errors and process the document without loading it all in memory first -- using as little memory as needed to hold the state of the processor -- just a short string in our example. Our running example is an XML document representing a finite map: a collection of key-value pairs where key is an integer: <map> <kv><key>1</key><value>v1</value></kv> <kv><key>2</key><value>v2</value></kv> <kv><key>bad</key><value>v3</value></kv> The above document is both ill-formed (missing the end tag) and invalid (one key is bad: non-read-able). We would like to write a lookup function for a key in such an encoded map. We should report an ill-formedness error always. The reporting of validation errors may vary. The function xml_lookup :: Monad m => Key -> Iteratee Char m (Maybe Value) xml_lookup key = id .| xml_enum default_handlers .| xml_normalize .| kv_enum (lkup key) always reports the validation errors. The function is built by composition from smaller, separately developed and tested pipeline components: parsing of a document to the XMLStream, normalization, converting the XMLStream to a stream of (Key,Value) pairs and finally searching the stream. A similar function that replaces kv_enum with kv_enum_pterm terminates the (Key,Value) conversion as soon as its client iteratee finished. In that case if the lookup succeeds before we encounter the bad key, no error is reported. Ill-formedness errors are raised always. The code http://okmij.org/ftp/Haskell/Iteratee/XMLookup.hs shows the examples of both methods of validation error reporting. This code also illustrates the library of parsing combinators, which represent the element structure (`DTD').

4581

Age (days ago)

4591

Last active (days ago)

List overview

Download

5 comments

5 participants

participants (5)

Albert Y. C. Lai
Jason Dagit
Johan Tibell
Malcolm Wallace
oleg＠okmij.org