bizarre memory usage with data.binary

Anatoly Yakovenko

2 Oct 2007 2 Oct '07

11:08 p.m.

i am getting some weird memory usage out of this program: module Main where import Data.Binary import Data.List(foldl') main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done" it goes up to 500M and down to 17M on windows. Its build with ghc 6.6.1 with the latest data.binary Any ideas what could be causing the memory usage to jump around so much? Thanks, Anatoly

Show replies by date

Roel van Dijk

2 Oct 2 Oct

11:22 p.m.

Does it terminate? Looks like you are summing all the natural numbers. On a turing machine it should run forever, on a real computer it should run out of memory. Unless I am missing something obvious :-)

Stefan O'Rear

11:25 p.m.

On Wed, Oct 03, 2007 at 01:22:25AM +0200, Roel van Dijk wrote:

...

Does it terminate?

Looks like you are summing all the natural numbers. On a turing machine it should run forever, on a real computer it should run out of memory. Unless I am missing something obvious :-)

There are only about 4 billion distinct values of type Int, 2 billion of which are positive. Integer is required for bigger values Stefan

Stefan O'Rear

11:27 p.m.

On Tue, Oct 02, 2007 at 04:08:01PM -0700, Anatoly Yakovenko wrote:

...

i am getting some weird memory usage out of this program:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"

it goes up to 500M and down to 17M on windows. Its build with ghc 6.6.1 with the latest data.binary

Any ideas what could be causing the memory usage to jump around so much?

Only 500M? encode for lists is strict, I would have expected around 80GB usage... What does -ddump-simpl-stats say? Stefan

Anatoly Yakovenko

3 Oct 3 Oct

12:01 a.m.

Program1: module Main where import Data.Binary import Data.List(foldl') main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done" vs Program2: module Main where import Data.Binary import Data.List(foldl') main = do let sum' = foldl' (+) 0 let list::[Int] = [1..] print $ sum' list print "done" neither program is expected to terminate. The point of these examples is to demonstrate that Data.Binary encode and decode have some strange memory allocation patters. If you run Program1, it will run forever, but its memory usage on my machine goes to 500M then back down to 17M then back up to 500M then back down to 17M... repeatedly. I don't think this has anything to do with running out of space in a 32 bit integer. Program2 on the other hand runs at constant memory at around 2M. Anatoly On 10/2/07, Anatoly Yakovenko wrote:

...

i am getting some weird memory usage out of this program:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"

it goes up to 500M and down to 17M on windows. Its build with ghc 6.6.1 with the latest data.binary

Any ideas what could be causing the memory usage to jump around so much?

Thanks, Anatoly

Don Stewart

12:04 a.m.

aeyakovenko:

...

Program1:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"

vs

Program2:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = [1..] print $ sum' list print "done"

neither program is expected to terminate. The point of these examples is to demonstrate that Data.Binary encode and decode have some strange memory allocation patters.

The encode instance for lists is fairly strict: instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get This is ok, since typically you aren't serialising infinite structures. Use a newtype, and a lazier instance, if you need to do this. -- Don

Anatoly Yakovenko

12:32 a.m.

On 10/2/07, Don Stewart wrote:

...

aeyakovenko:

...
Program1:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"

...

The encode instance for lists is fairly strict:

instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get

This is ok, since typically you aren't serialising infinite structures.

hmm, this doesn't make sense to me, it goes up to 500M then down then back up, then back down, so it doesn't just run out of memory because of (length l) forces you to evaluate the entire list.

...

Use a newtype, and a lazier instance, if you need to do this.

Thanks for the tip. this runs at a constant 4M module Main where import Data.Binary import Data.List(foldl') data Foo = Foo Int Foo | Null instance Binary Foo where put (Foo i f) = do put (0 :: Word8) put i put f put (Null) = do put (1 :: Word8) get = do t <- get :: Get Word8 case t of 0 -> do i <- get f <- get return (Foo i f) 1 -> do return Null sumFoo zz (Null) = zz sumFoo zz (Foo ii ff) = sumFoo (zz + ii) ff fooBar i = Foo i (fooBar (i + 1)) main = do print $ sumFoo 0 $ decode $ encode $ fooBar 1 print "done"

Felipe Almeida Lessa

12:37 a.m.

On 10/2/07, Don Stewart wrote:

...

The encode instance for lists is fairly strict:

instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get

This is ok, since typically you aren't serialising infinite structures.

Use a newtype, and a lazier instance, if you need to do this.

Maybe something like this (WARNING: ugly code, as in "not elegant", follows): newtype List a = List [a] split255 :: [a] -> (Word8, [a], [a]) split255 = s 0 where s 255 xs = (255, [], xs) s n (x:xs) = let (n', ys, zs) = s (n+1) xs in (n', x:ys, zs) s n [] = (n, [], []) instance Binary a => Binary (List a) where put (List l) = let (n, xs, ys) = split255 l in do putWord8 n mapM_ put xs when (n == 255) (put $ List ys) get = do n <- getWord8 xs <- replicateM (fromEnum n) get if n == 255 then get >>= \(List ys) -> return (List $ xs ++ ys) else return (List xs) It uses chunks of 255 elements and so doesn't traverse the whole list until starting to output something. OTOH, there's some data overhead (1 byte every 255 elements). Seems to run your example fine and in constant memory. HTH, -- Felipe.

Dan Weston

12:11 a.m.

Maybe what you are observing is that the operational semantics of undefined is undefined. The program can halt, run forever, use no memory, use all the memory. Although I doubt what GHC does with this code is a random process, I don't think it's too meaningful to ask what are the space usage patterns of a program returning bottom. Anatoly Yakovenko wrote:

...

Program1:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"

vs

Program2:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = [1..] print $ sum' list print "done"

neither program is expected to terminate. The point of these examples is to demonstrate that Data.Binary encode and decode have some strange memory allocation patters.

If you run Program1, it will run forever, but its memory usage on my machine goes to 500M then back down to 17M then back up to 500M then back down to 17M... repeatedly. I don't think this has anything to do with running out of space in a 32 bit integer.

Program2 on the other hand runs at constant memory at around 2M.

Anatoly

On 10/2/07, Anatoly Yakovenko wrote:

...
i am getting some weird memory usage out of this program:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"

it goes up to 500M and down to 17M on windows. Its build with ghc 6.6.1 with the latest data.binary

Any ideas what could be causing the memory usage to jump around so much?

Thanks, Anatoly

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Anatoly Yakovenko

12:35 a.m.

servers never terminate, pretend that i have a server that reads a list encoded with data.binary from a socket, and sums it up and returns the current sum. i would expect it to run in constant memory, never terminate, and do useful work. which is basically the problem that I am facing right now. my program seems to grow randomly in memory use when marshaling large data types encoded using data.binary. On 10/2/07, Dan Weston wrote:

...

Maybe what you are observing is that the operational semantics of undefined is undefined. The program can halt, run forever, use no memory, use all the memory.

Although I doubt what GHC does with this code is a random process, I don't think it's too meaningful to ask what are the space usage patterns of a program returning bottom.

Anatoly Yakovenko wrote:

...
Program1:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"

vs

Program2:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = [1..] print $ sum' list print "done"

neither program is expected to terminate. The point of these examples is to demonstrate that Data.Binary encode and decode have some strange memory allocation patters.

If you run Program1, it will run forever, but its memory usage on my machine goes to 500M then back down to 17M then back up to 500M then back down to 17M... repeatedly. I don't think this has anything to do with running out of space in a 32 bit integer.

Program2 on the other hand runs at constant memory at around 2M.

Anatoly

On 10/2/07, Anatoly Yakovenko wrote:

...
i am getting some weird memory usage out of this program:

module Main where

import Data.Binary import Data.List(foldl')

main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"

it goes up to 500M and down to 17M on windows. Its build with ghc 6.6.1 with the latest data.binary

Any ideas what could be causing the memory usage to jump around so much?

Thanks, Anatoly

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Don Stewart

12:38 a.m.

aeyakovenko:

...

servers never terminate, pretend that i have a server that reads a list encoded with data.binary from a socket, and sums it up and returns the current sum. i would expect it to run in constant memory, never terminate, and do useful work.

which is basically the problem that I am facing right now. my program seems to grow randomly in memory use when marshaling large data types encoded using data.binary.

If its specifically the list instance, where we currently trade laziness for efficiency of encoding (which may or may not be the right thing), I'd suggest a fully lazy encoding instance? -- Don

Anatoly Yakovenko

12:51 a.m.

...

If its specifically the list instance, where we currently trade laziness for efficiency of encoding (which may or may not be the right thing), I'd suggest a fully lazy encoding instance?

Its not really a list, its more of a tree that has shared nodes, so something like this: A / \ B C \ / D / \ E F I suspect that maybe after encode/decode i end up with something like A / \ B C / \ D D / \ / \ E F E F

Spencer Janssen

5:26 p.m.

On Tuesday 02 October 2007 19:51:47 Anatoly Yakovenko wrote:

...

...
If its specifically the list instance, where we currently trade laziness for efficiency of encoding (which may or may not be the right thing), I'd suggest a fully lazy encoding instance?

Its not really a list, its more of a tree that has shared nodes, so something like this:

A / \ B C \ / D / \ E F

I suspect that maybe after encode/decode i end up with something like

A / \ B C / \ D D / \ / \ E F E F

That is correct, binary doesn't attempt to share substructures. If you'd like to do this, you'll need to do it by hand. Cheers, Spencer Janssen

Jules Bean

7:03 p.m.

Spencer Janssen wrote:

...

On Tuesday 02 October 2007 19:51:47 Anatoly Yakovenko wrote:

...
...
If its specifically the list instance, where we currently trade laziness for efficiency of encoding (which may or may not be the right thing), I'd suggest a fully lazy encoding instance? Its not really a list, its more of a tree that has shared nodes, so something like this:

A / \ B C \ / D / \ E F

I suspect that maybe after encode/decode i end up with something like

A / \ B C / \ D D / \ / \ E F E F

That is correct, binary doesn't attempt to share substructures. If you'd like to do this, you'll need to do it by hand.

...and indeed it can't be done, except by the naive brute-force method of comparing every subtree, possibly optimised by cryptographically hashing a representation of every subtree, since sharing isn't an observable property. Of course, hashing doesn't actually observe the real sharing present, rather, it computes maximal sharing. There are some applications where this could be a worthwhile win. Jules PS Well, except unsafePtrEquality but I don't really want to go there...

Anatoly Yakovenko

9:33 p.m.

...

...and indeed it can't be done, except by the naive brute-force method of comparing every subtree, possibly optimized by cryptographically hashing a representation of every subtree, since sharing isn't an observable property.

i was thinking that instead of having a reference to a node, each node just holds an index from an array of nodes. Traversal would take an extra step, but it should fix the problem with encode/decode.

Thomas Conway

4 Oct 4 Oct

8:57 a.m.

On 10/4/07, Jules Bean wrote:

...

...and indeed it can't be done, except by the naive brute-force method of comparing every subtree, possibly optimised by cryptographically hashing a representation of every subtree, since sharing isn't an observable property.

At least one Prolog implementation (I forget which, I'm sorry), had a [de]serialisation library which used a hash-consing approach. Basically, it did its serialization using a post-order traversal and emitted references to previous values when the same value had already been emitted. Not rocket science. Actually, I've heard a Prolog guy - Bart Demoen - talk about doing pretty much this during GC to improve sharing. cheers, T. -- Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

Jules Bean

9:01 a.m.

Thomas Conway wrote:

...

On 10/4/07, Jules Bean wrote:

...
...and indeed it can't be done, except by the naive brute-force method of comparing every subtree, possibly optimised by cryptographically hashing a representation of every subtree, since sharing isn't an observable property.

At least one Prolog implementation (I forget which, I'm sorry), had a [de]serialisation library which used a hash-consing approach. Basically, it did its serialization using a post-order traversal and emitted references to previous values when the same value had already been emitted. Not rocket science. Actually, I've heard a Prolog guy - Bart Demoen - talk about doing pretty much this during GC to improve sharing.

Not rocket science at all, but relatively expensive. A time/space tradeoff. And these days, with memory and disks feeling cheap, most people want to trade time for space, not the other way around. Not everyone, of course. Jules

Aaron Denney

10:03 a.m.

On 2007-10-04, Jules Bean wrote:

...

Thomas Conway wrote:

...
On 10/4/07, Jules Bean wrote:

...
...and indeed it can't be done, except by the naive brute-force method of comparing every subtree, possibly optimised by cryptographically hashing a representation of every subtree, since sharing isn't an observable property.

At least one Prolog implementation (I forget which, I'm sorry), had a [de]serialisation library which used a hash-consing approach. Basically, it did its serialization using a post-order traversal and emitted references to previous values when the same value had already been emitted. Not rocket science. Actually, I've heard a Prolog guy - Bart Demoen - talk about doing pretty much this during GC to improve sharing.

Not rocket science at all, but relatively expensive. A time/space tradeoff. And these days, with memory and disks feeling cheap, most people want to trade time for space, not the other way around.

Caches are still limited sizes, and that can make a huge difference for time. -- Aaron Denney -><-

6488

Age (days ago)

6490

Last active (days ago)

List overview

Download

17 comments

10 participants

participants (10)

Aaron Denney
Anatoly Yakovenko
Dan Weston
Don Stewart
Felipe Almeida Lessa
Jules Bean
Roel van Dijk
Spencer Janssen
Stefan O'Rear
Thomas Conway