
i am getting some weird memory usage out of this program: module Main where import Data.Binary import Data.List(foldl') main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done" it goes up to 500M and down to 17M on windows. Its build with ghc 6.6.1 with the latest data.binary Any ideas what could be causing the memory usage to jump around so much? Thanks, Anatoly

Does it terminate? Looks like you are summing all the natural numbers. On a turing machine it should run forever, on a real computer it should run out of memory. Unless I am missing something obvious :-)

On Wed, Oct 03, 2007 at 01:22:25AM +0200, Roel van Dijk wrote:
Does it terminate?
Looks like you are summing all the natural numbers. On a turing machine it should run forever, on a real computer it should run out of memory. Unless I am missing something obvious :-)
There are only about 4 billion distinct values of type Int, 2 billion of which are positive. Integer is required for bigger values Stefan

On Tue, Oct 02, 2007 at 04:08:01PM -0700, Anatoly Yakovenko wrote:
i am getting some weird memory usage out of this program:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"
it goes up to 500M and down to 17M on windows. Its build with ghc 6.6.1 with the latest data.binary
Any ideas what could be causing the memory usage to jump around so much?
Only 500M? encode for lists is strict, I would have expected around 80GB usage... What does -ddump-simpl-stats say? Stefan

Program1:
module Main where
import Data.Binary
import Data.List(foldl')
main = do
let sum' = foldl' (+) 0
let list::[Int] = decode $ encode $ ([1..] :: [Int])
print $ sum' list
print "done"
vs
Program2:
module Main where
import Data.Binary
import Data.List(foldl')
main = do
let sum' = foldl' (+) 0
let list::[Int] = [1..]
print $ sum' list
print "done"
neither program is expected to terminate. The point of these examples
is to demonstrate that Data.Binary encode and decode have some strange
memory allocation patters.
If you run Program1, it will run forever, but its memory usage on my
machine goes to 500M then back down to 17M then back up to 500M then
back down to 17M... repeatedly. I don't think this has anything to do
with running out of space in a 32 bit integer.
Program2 on the other hand runs at constant memory at around 2M.
Anatoly
On 10/2/07, Anatoly Yakovenko
i am getting some weird memory usage out of this program:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"
it goes up to 500M and down to 17M on windows. Its build with ghc 6.6.1 with the latest data.binary
Any ideas what could be causing the memory usage to jump around so much?
Thanks, Anatoly

aeyakovenko:
Program1:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"
vs
Program2:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = [1..] print $ sum' list print "done"
neither program is expected to terminate. The point of these examples is to demonstrate that Data.Binary encode and decode have some strange memory allocation patters.
The encode instance for lists is fairly strict: instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get This is ok, since typically you aren't serialising infinite structures. Use a newtype, and a lazier instance, if you need to do this. -- Don

On 10/2/07, Don Stewart
aeyakovenko:
Program1:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"
The encode instance for lists is fairly strict:
instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get
This is ok, since typically you aren't serialising infinite structures.
hmm, this doesn't make sense to me, it goes up to 500M then down then back up, then back down, so it doesn't just run out of memory because of (length l) forces you to evaluate the entire list.
Use a newtype, and a lazier instance, if you need to do this.
Thanks for the tip. this runs at a constant 4M module Main where import Data.Binary import Data.List(foldl') data Foo = Foo Int Foo | Null instance Binary Foo where put (Foo i f) = do put (0 :: Word8) put i put f put (Null) = do put (1 :: Word8) get = do t <- get :: Get Word8 case t of 0 -> do i <- get f <- get return (Foo i f) 1 -> do return Null sumFoo zz (Null) = zz sumFoo zz (Foo ii ff) = sumFoo (zz + ii) ff fooBar i = Foo i (fooBar (i + 1)) main = do print $ sumFoo 0 $ decode $ encode $ fooBar 1 print "done"

On 10/2/07, Don Stewart
The encode instance for lists is fairly strict:
instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get
This is ok, since typically you aren't serialising infinite structures.
Use a newtype, and a lazier instance, if you need to do this.
Maybe something like this (WARNING: ugly code, as in "not elegant", follows): newtype List a = List [a] split255 :: [a] -> (Word8, [a], [a]) split255 = s 0 where s 255 xs = (255, [], xs) s n (x:xs) = let (n', ys, zs) = s (n+1) xs in (n', x:ys, zs) s n [] = (n, [], []) instance Binary a => Binary (List a) where put (List l) = let (n, xs, ys) = split255 l in do putWord8 n mapM_ put xs when (n == 255) (put $ List ys) get = do n <- getWord8 xs <- replicateM (fromEnum n) get if n == 255 then get >>= \(List ys) -> return (List $ xs ++ ys) else return (List xs) It uses chunks of 255 elements and so doesn't traverse the whole list until starting to output something. OTOH, there's some data overhead (1 byte every 255 elements). Seems to run your example fine and in constant memory. HTH, -- Felipe.

Maybe what you are observing is that the operational semantics of undefined is undefined. The program can halt, run forever, use no memory, use all the memory. Although I doubt what GHC does with this code is a random process, I don't think it's too meaningful to ask what are the space usage patterns of a program returning bottom. Anatoly Yakovenko wrote:
Program1:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"
vs
Program2:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = [1..] print $ sum' list print "done"
neither program is expected to terminate. The point of these examples is to demonstrate that Data.Binary encode and decode have some strange memory allocation patters.
If you run Program1, it will run forever, but its memory usage on my machine goes to 500M then back down to 17M then back up to 500M then back down to 17M... repeatedly. I don't think this has anything to do with running out of space in a 32 bit integer.
Program2 on the other hand runs at constant memory at around 2M.
Anatoly
On 10/2/07, Anatoly Yakovenko
wrote: i am getting some weird memory usage out of this program:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"
it goes up to 500M and down to 17M on windows. Its build with ghc 6.6.1 with the latest data.binary
Any ideas what could be causing the memory usage to jump around so much?
Thanks, Anatoly
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

servers never terminate, pretend that i have a server that reads a
list encoded with data.binary from a socket, and sums it up and
returns the current sum. i would expect it to run in constant memory,
never terminate, and do useful work.
which is basically the problem that I am facing right now. my program
seems to grow randomly in memory use when marshaling large data types
encoded using data.binary.
On 10/2/07, Dan Weston
Maybe what you are observing is that the operational semantics of undefined is undefined. The program can halt, run forever, use no memory, use all the memory.
Although I doubt what GHC does with this code is a random process, I don't think it's too meaningful to ask what are the space usage patterns of a program returning bottom.
Anatoly Yakovenko wrote:
Program1:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"
vs
Program2:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = [1..] print $ sum' list print "done"
neither program is expected to terminate. The point of these examples is to demonstrate that Data.Binary encode and decode have some strange memory allocation patters.
If you run Program1, it will run forever, but its memory usage on my machine goes to 500M then back down to 17M then back up to 500M then back down to 17M... repeatedly. I don't think this has anything to do with running out of space in a 32 bit integer.
Program2 on the other hand runs at constant memory at around 2M.
Anatoly
On 10/2/07, Anatoly Yakovenko
wrote: i am getting some weird memory usage out of this program:
module Main where
import Data.Binary import Data.List(foldl')
main = do let sum' = foldl' (+) 0 let list::[Int] = decode $ encode $ ([1..] :: [Int]) print $ sum' list print "done"
it goes up to 500M and down to 17M on windows. Its build with ghc 6.6.1 with the latest data.binary
Any ideas what could be causing the memory usage to jump around so much?
Thanks, Anatoly
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

aeyakovenko:
servers never terminate, pretend that i have a server that reads a list encoded with data.binary from a socket, and sums it up and returns the current sum. i would expect it to run in constant memory, never terminate, and do useful work.
which is basically the problem that I am facing right now. my program seems to grow randomly in memory use when marshaling large data types encoded using data.binary.
If its specifically the list instance, where we currently trade laziness for efficiency of encoding (which may or may not be the right thing), I'd suggest a fully lazy encoding instance? -- Don

If its specifically the list instance, where we currently trade laziness for efficiency of encoding (which may or may not be the right thing), I'd suggest a fully lazy encoding instance?
Its not really a list, its more of a tree that has shared nodes, so something like this: A / \ B C \ / D / \ E F I suspect that maybe after encode/decode i end up with something like A / \ B C / \ D D / \ / \ E F E F

On Tuesday 02 October 2007 19:51:47 Anatoly Yakovenko wrote:
If its specifically the list instance, where we currently trade laziness for efficiency of encoding (which may or may not be the right thing), I'd suggest a fully lazy encoding instance?
Its not really a list, its more of a tree that has shared nodes, so something like this:
A / \ B C \ / D / \ E F
I suspect that maybe after encode/decode i end up with something like
A / \ B C / \ D D / \ / \ E F E F
That is correct, binary doesn't attempt to share substructures. If you'd like to do this, you'll need to do it by hand. Cheers, Spencer Janssen

Spencer Janssen wrote:
On Tuesday 02 October 2007 19:51:47 Anatoly Yakovenko wrote:
If its specifically the list instance, where we currently trade laziness for efficiency of encoding (which may or may not be the right thing), I'd suggest a fully lazy encoding instance? Its not really a list, its more of a tree that has shared nodes, so something like this:
A / \ B C \ / D / \ E F
I suspect that maybe after encode/decode i end up with something like
A / \ B C / \ D D / \ / \ E F E F
That is correct, binary doesn't attempt to share substructures. If you'd like to do this, you'll need to do it by hand.
...and indeed it can't be done, except by the naive brute-force method of comparing every subtree, possibly optimised by cryptographically hashing a representation of every subtree, since sharing isn't an observable property. Of course, hashing doesn't actually observe the real sharing present, rather, it computes maximal sharing. There are some applications where this could be a worthwhile win. Jules PS Well, except unsafePtrEquality but I don't really want to go there...

...and indeed it can't be done, except by the naive brute-force method of comparing every subtree, possibly optimized by cryptographically hashing a representation of every subtree, since sharing isn't an observable property.
i was thinking that instead of having a reference to a node, each node just holds an index from an array of nodes. Traversal would take an extra step, but it should fix the problem with encode/decode.

On 10/4/07, Jules Bean
...and indeed it can't be done, except by the naive brute-force method of comparing every subtree, possibly optimised by cryptographically hashing a representation of every subtree, since sharing isn't an observable property.
At least one Prolog implementation (I forget which, I'm sorry), had a [de]serialisation library which used a hash-consing approach. Basically, it did its serialization using a post-order traversal and emitted references to previous values when the same value had already been emitted. Not rocket science. Actually, I've heard a Prolog guy - Bart Demoen - talk about doing pretty much this during GC to improve sharing. cheers, T. -- Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

Thomas Conway wrote:
On 10/4/07, Jules Bean
wrote: ...and indeed it can't be done, except by the naive brute-force method of comparing every subtree, possibly optimised by cryptographically hashing a representation of every subtree, since sharing isn't an observable property.
At least one Prolog implementation (I forget which, I'm sorry), had a [de]serialisation library which used a hash-consing approach. Basically, it did its serialization using a post-order traversal and emitted references to previous values when the same value had already been emitted. Not rocket science. Actually, I've heard a Prolog guy - Bart Demoen - talk about doing pretty much this during GC to improve sharing.
Not rocket science at all, but relatively expensive. A time/space tradeoff. And these days, with memory and disks feeling cheap, most people want to trade time for space, not the other way around. Not everyone, of course. Jules

On 2007-10-04, Jules Bean
Thomas Conway wrote:
On 10/4/07, Jules Bean
wrote: ...and indeed it can't be done, except by the naive brute-force method of comparing every subtree, possibly optimised by cryptographically hashing a representation of every subtree, since sharing isn't an observable property.
At least one Prolog implementation (I forget which, I'm sorry), had a [de]serialisation library which used a hash-consing approach. Basically, it did its serialization using a post-order traversal and emitted references to previous values when the same value had already been emitted. Not rocket science. Actually, I've heard a Prolog guy - Bart Demoen - talk about doing pretty much this during GC to improve sharing.
Not rocket science at all, but relatively expensive. A time/space tradeoff. And these days, with memory and disks feeling cheap, most people want to trade time for space, not the other way around.
Caches are still limited sizes, and that can make a huge difference for time. -- Aaron Denney -><-
participants (10)
-
Aaron Denney
-
Anatoly Yakovenko
-
Dan Weston
-
Don Stewart
-
Felipe Almeida Lessa
-
Jules Bean
-
Roel van Dijk
-
Spencer Janssen
-
Stefan O'Rear
-
Thomas Conway