Is it possible to have constant-space JSON decoding?

Hi, I'm trying to parse a rather simple but big JSON message, but it turns out that memory consumption is a problem, and I'm not sure what the actual cause is. Let's say we have a simple JSON message: an array of 5 million numbers. I would like to parse this in constant space, such that if I only need the last element, overall memory usage is low (yes, unrealistic use, but please bear with me for a moment). Using aeson, I thought the following program will be nicely-behaved:
import Data.Aeson import qualified Data.Attoparsec.ByteString.Lazy as AL import qualified Data.ByteString.Lazy as L
main = do r <- L.readFile "numbers" case AL.parse json r :: AL.Result Value of AL.Fail _ context errs -> do print context print errs AL.Done _ d -> case fromJSON d::Result [Value] of Error x -> putStrLn x Success d -> print $ last d
However, this uses (according to +RTS -s) ~1150 GB of memory. I've tried switching from json to json', but while that uses slightly less memory (~1020 MB) it clearly can't be fully lazy, since it forces conversion to actual types. Looking at the implementation of "FromJSON [a]", it seems we could optimise the code by not forcing to a list. New (partial) version does:
AL.Done _ d -> case d of Array v -> print $ V.last v
And this indeed reduces the memory, when using json', to about ~700MB. Better, but still a lot. It seems that the Array constructor holds a vector, and this results in too much strictness? Looking at the memory profiles (with "json" and "Array"), things are quite interesting - lots of VOID, very small USE, all generated from Data.Aeson.Parser.Internal:array. Using -hd, we have a reasonable equal split between various attoparsec combinators, Data.Aeson.Parser.Internal epressions, etc. So, am I doing something wrong, or is it simply not feasible to get constant-space JSON decoding? Using the 'json' library instead of 'aeson' is no better, since that wants the input as a String which consumes even more memory (and dies, when compiled with -O2, with out of stack even for 64MB stack). thanks, iustin

Aeson doesn't have an incremental parser so it'll be
difficult/impossible to do what you want. I guess you want an
event-based JSON parser, such as yajl [1]. I've never used this
library, though.
Cheers,
[1] http://hackage.haskell.org/package/yajl-0.3.0.5
On Tue, Dec 4, 2012 at 12:11 PM, Iustin Pop
Hi,
I'm trying to parse a rather simple but big JSON message, but it turns out that memory consumption is a problem, and I'm not sure what the actual cause is.
Let's say we have a simple JSON message: an array of 5 million numbers. I would like to parse this in constant space, such that if I only need the last element, overall memory usage is low (yes, unrealistic use, but please bear with me for a moment).
Using aeson, I thought the following program will be nicely-behaved:
import Data.Aeson import qualified Data.Attoparsec.ByteString.Lazy as AL import qualified Data.ByteString.Lazy as L
main = do r <- L.readFile "numbers" case AL.parse json r :: AL.Result Value of AL.Fail _ context errs -> do print context print errs AL.Done _ d -> case fromJSON d::Result [Value] of Error x -> putStrLn x Success d -> print $ last d
However, this uses (according to +RTS -s) ~1150 GB of memory. I've tried switching from json to json', but while that uses slightly less memory (~1020 MB) it clearly can't be fully lazy, since it forces conversion to actual types.
Looking at the implementation of "FromJSON [a]", it seems we could optimise the code by not forcing to a list. New (partial) version does:
AL.Done _ d -> case d of Array v -> print $ V.last v
And this indeed reduces the memory, when using json', to about ~700MB. Better, but still a lot.
It seems that the Array constructor holds a vector, and this results in too much strictness?
Looking at the memory profiles (with "json" and "Array"), things are quite interesting - lots of VOID, very small USE, all generated from Data.Aeson.Parser.Internal:array. Using -hd, we have a reasonable equal split between various attoparsec combinators, Data.Aeson.Parser.Internal epressions, etc.
So, am I doing something wrong, or is it simply not feasible to get constant-space JSON decoding?
Using the 'json' library instead of 'aeson' is no better, since that wants the input as a String which consumes even more memory (and dies, when compiled with -O2, with out of stack even for 64MB stack).
thanks, iustin
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Felipe.

On Tue, Dec 04, 2012 at 12:23:19PM -0200, Felipe Almeida Lessa wrote:
Aeson doesn't have an incremental parser so it'll be difficult/impossible to do what you want. I guess you want an event-based JSON parser, such as yajl [1]. I've never used this library, though.
Ah, I see. Thanks, I wasn't aware of that library. So it seems that using either 'aeson' or 'json', we should be prepared to pay the full cost of input message (string/bytestring) plus the cost of the converted data structures. thanks! iustin

Aeson is used for the very common usecase of short messages that need to be
parsed as quickly as possible into a static structure. A lot of things are
sacrificed to make this work, such as incremental parsing and good error
messages. It works great for web APIs like twitter's.
I didn't even know people used JSON to store millions of integers. It
sounds like fun.
- Clark
On Tue, Dec 4, 2012 at 9:38 AM, Iustin Pop
On Tue, Dec 04, 2012 at 12:23:19PM -0200, Felipe Almeida Lessa wrote:
Aeson doesn't have an incremental parser so it'll be difficult/impossible to do what you want. I guess you want an event-based JSON parser, such as yajl [1]. I've never used this library, though.
Ah, I see. Thanks, I wasn't aware of that library.
So it seems that using either 'aeson' or 'json', we should be prepared to pay the full cost of input message (string/bytestring) plus the cost of the converted data structures.
thanks! iustin
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Dec 04, 2012 at 09:47:52AM -0500, Clark Gaebel wrote:
Aeson is used for the very common usecase of short messages that need to be parsed as quickly as possible into a static structure. A lot of things are sacrificed to make this work, such as incremental parsing and good error messages. It works great for web APIs like twitter's.
I see, good to know.
I didn't even know people used JSON to store millions of integers. It sounds like fun.
Actually, that's not the actual use case :), this was just an example to show memory use with trivial data structures (where strictness/lazyness is easier to reason about). In my case, I have reasonably-sized message of complex objects, which results in the same memory profile: cost of input message (as String/ByteString) plus cost of the converted objects. What bothers me is that I don't seem to be able to at least remove the cost of the input data after parsing, due to non-strict types I convert to. thanks, iustin

Clark Gaebel
I didn't even know people used JSON to store millions of integers. It sounds like fun.
Actually, JSON is quite convenient if you need a standardized common interchange format between Python, Ruby, JS et al. based components as it directly maps to a common subset of primitive data-structure available in those languages (i.e. bool,strings,numbers,arrays,objects) and also very efficient JSON decoders are available by now (maybe even part of the respective standard library) So I'm actually struggling myself to find a way to get large JSON text parsed in Haskell with a comparable memory footprint to e.g. Python. As for generating large JSON document, I've found in 'json-builder' a more memory efficient alternative to 'aeson'.

Iustin Pop
Let's say we have a simple JSON message: an array of 5 million numbers. I would like to parse this in constant space, such that if I only need the last element, overall memory usage is low (yes, unrealistic use, but please bear with me for a moment).
Using aeson, I thought the following program will be nicely-behaved:
part of the problem is that aeson builds an intermediate JSON parse-tree which has quite an overhead for representing a list of numbers on the heap as each numbers requires multiple heap objects (see also [1]). This is an area where e.g. Python has a significantly smaller footprint (mostly due to a more efficient heap representation). [...]
It seems that the Array constructor holds a vector, and this results in too much strictness?
btw, using a list on the other hand would add an overhead of 2 words (due to the ':' constructor) for representing each JSON array element in the parse-tree, that's probably why aeson uses vectors instead of lists. [...] cheers, hvr [1]: https://github.com/bos/aeson/issues/22

On Tue, Dec 04, 2012 at 03:58:24PM +0100, Herbert Valerio Riedel wrote:
Iustin Pop
writes: [...]
Let's say we have a simple JSON message: an array of 5 million numbers. I would like to parse this in constant space, such that if I only need the last element, overall memory usage is low (yes, unrealistic use, but please bear with me for a moment).
Using aeson, I thought the following program will be nicely-behaved:
part of the problem is that aeson builds an intermediate JSON parse-tree which has quite an overhead for representing a list of numbers on the heap as each numbers requires multiple heap objects (see also [1]). This is an area where e.g. Python has a significantly smaller footprint (mostly due to a more efficient heap representation).
Ah, I see. Thanks for the link, so that's from where the 'S' constructor was coming from in the -hd output. And indeed, I was surprised as well that Python has a more efficient representation for this.
[...]
It seems that the Array constructor holds a vector, and this results in too much strictness?
btw, using a list on the other hand would add an overhead of 2 words (due to the ':' constructor) for representing each JSON array element in the parse-tree, that's probably why aeson uses vectors instead of lists.
Ack. thanks, iustin
participants (4)
-
Clark Gaebel
-
Felipe Almeida Lessa
-
Herbert Valerio Riedel
-
Iustin Pop