Disappointing thought on Haskell -- a simple space leak on "insert" is hard bug to catch

Here's a problem I just found with Haskell (and maybe functional programming) that I would like to know how to avoid. Please don't get me wrong. I'm in *LUUV* with Haskell! :-) Exercise 2.3 of Chris Okasaki's book "Purely Functional Data Structures" shows that something as simple as an insert function in a binary tree can cause a space leak in Haskell. (and other functional languages as well) Here's a simple tree type: data Tree a = Empty | MkTree (Tree a) a (Tree a) Here's the version of insert with the space leak: -- ** SPACE LEAK! ** version -- this version unnecessarily copies the path to the inserted element even if -- the same element is already present in the tree. -- a million insertions of the same element will cause a million trees -- to exist whereone would suffice. treeInsert :: Ord a => a -> Tree a -> Tree a treeInsert e Empty = MkTree Empty e Empty treeInsert e t@(MkTree a x b) | e > x = MkTree a x (treeInsert e b) | e < x = MkTree (treeInsert e a) x b | otherwise = t Here's a version without the space leak: treeInsert :: Ord a => a -> Tree a -> Tree a treeInsert e t = case (treeInsert' e t) of Nothing -> t Just x -> x where treeInsert' :: Ord a => a -> Tree a -> Maybe (Tree a) treeInsert' e Empty = return $ MkTree Empty e Empty treeInsert' e (MkTree a x b) | e > x = do {r <- treeInsert' e b; return $ MkTree a x r} | e < x = do {r <- treeInsert' e a; return $ MkTree r x b} | otherwise = Nothing This is so tricky that I think it is worth questioning whether Haskell is helping here. First, just spotting that the first version leads to a space leak is not trivial. Second, the fix is so much uglier than the original version. It is disappointing to have to write it! Questions: 1) ** Is there a warning sign that would easily flag this space leak? ** 2) Is there a simpler fix? Cheers, Dimitri

Sometimes one wants duplicate keys.
Perhaps findkey might be better.
--
--
Sent from an expensive device which will be obsolete in a few months! :D
Casey
On Aug 15, 2014 5:15 PM, "Dimitri DeFigueiredo"
Here's a problem I just found with Haskell (and maybe functional programming) that I would like to know how to avoid.
Please don't get me wrong. I'm in *LUUV* with Haskell! :-)
Exercise 2.3 of Chris Okasaki's book "Purely Functional Data Structures" shows that something as simple as an insert function in a binary tree can cause a space leak in Haskell. (and other functional languages as well)
Here's a simple tree type:
data Tree a = Empty | MkTree (Tree a) a (Tree a)
Here's the version of insert with the space leak:
-- ** SPACE LEAK! ** version -- this version unnecessarily copies the path to the inserted element even if -- the same element is already present in the tree. -- a million insertions of the same element will cause a million trees -- to exist whereone would suffice.
treeInsert :: Ord a => a -> Tree a -> Tree a treeInsert e Empty = MkTree Empty e Empty treeInsert e t@(MkTree a x b) | e > x = MkTree a x (treeInsert e b) | e < x = MkTree (treeInsert e a) x b | otherwise = t
Here's a version without the space leak:
treeInsert :: Ord a => a -> Tree a -> Tree a treeInsert e t = case (treeInsert' e t) of Nothing -> t Just x -> x where treeInsert' :: Ord a => a -> Tree a -> Maybe (Tree a) treeInsert' e Empty = return $ MkTree Empty e Empty treeInsert' e (MkTree a x b) | e > x = do {r <- treeInsert' e b; return $ MkTree a x r} | e < x = do {r <- treeInsert' e a; return $ MkTree r x b} | otherwise = Nothing
This is so tricky that I think it is worth questioning whether Haskell is helping here. First, just spotting that the first version leads to a space leak is not trivial. Second, the fix is so much uglier than the original version. It is disappointing to have to write it!
Questions: 1) ** Is there a warning sign that would easily flag this space leak? ** 2) Is there a simpler fix?
Cheers,
Dimitri
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

On 08/16/2014 01:15 AM, Dimitri DeFigueiredo wrote:
Here's a problem I just found with Haskell (and maybe functional programming) that I would like to know how to avoid.
Please don't get me wrong. I'm in *LUUV* with Haskell! :-)
Exercise 2.3 of Chris Okasaki's book "Purely Functional Data Structures" shows that something as simple as an insert function in a binary tree can cause a space leak in Haskell. (and other functional languages as well)
Here's a simple tree type:
data Tree a = Empty | MkTree (Tree a) a (Tree a)
Here's the version of insert with the space leak:
-- ** SPACE LEAK! ** version -- this version unnecessarily copies the path to the inserted element even if -- the same element is already present in the tree. -- a million insertions of the same element will cause a million trees -- to exist whereone would suffice.
treeInsert :: Ord a => a -> Tree a -> Tree a treeInsert e Empty = MkTree Empty e Empty treeInsert e t@(MkTree a x b) | e > x = MkTree a x (treeInsert e b) | e < x = MkTree (treeInsert e a) x b | otherwise = t
Here's a version without the space leak:
treeInsert :: Ord a => a -> Tree a -> Tree a treeInsert e t = case (treeInsert' e t) of Nothing -> t Just x -> x where treeInsert' :: Ord a => a -> Tree a -> Maybe (Tree a) treeInsert' e Empty = return $ MkTree Empty e Empty treeInsert' e (MkTree a x b) | e > x = do {r <- treeInsert' e b; return $ MkTree a x r} | e < x = do {r <- treeInsert' e a; return $ MkTree r x b} | otherwise = Nothing
This is so tricky that I think it is worth questioning whether Haskell is helping here. First, just spotting that the first version leads to a space leak is not trivial. Second, the fix is so much uglier than the original version. It is disappointing to have to write it!
Questions: 1) ** Is there a warning sign that would easily flag this space leak? **
Benchmarking and profiling.
2) Is there a simpler fix?
I don't know.
Cheers,
Dimitri
-- Mateusz K.

Dimitri DeFigueiredo wrote:
Here's a problem I just found with Haskell (and maybe functional programming) that I would like to know how to avoid.
Please don't get me wrong. I'm in *LUUV* with Haskell! :-)
Exercise 2.3 of Chris Okasaki's book "Purely Functional Data Structures" shows that something as simple as an insert function in a binary tree can cause a space leak in Haskell. (and other functional languages as well)
Here's a simple tree type:
data Tree a = Empty | MkTree (Tree a) a (Tree a)
Here's the version of insert with the space leak:
-- ** SPACE LEAK! ** version -- this version unnecessarily copies the path to the inserted element even if -- the same element is already present in the tree. -- a million insertions of the same element will cause a million trees -- to exist whereone would suffice.
treeInsert :: Ord a => a -> Tree a -> Tree a treeInsert e Empty = MkTree Empty e Empty treeInsert e t@(MkTree a x b) | e > x = MkTree a x (treeInsert e b) | e < x = MkTree (treeInsert e a) x b | otherwise = t
Here's a version without the space leak:
treeInsert :: Ord a => a -> Tree a -> Tree a treeInsert e t = case (treeInsert' e t) of Nothing -> t Just x -> x where treeInsert' :: Ord a => a -> Tree a -> Maybe (Tree a) treeInsert' e Empty = return $ MkTree Empty e Empty treeInsert' e (MkTree a x b) | e > x = do {r <- treeInsert' e b; return $ MkTree a x r} | e < x = do {r <- treeInsert' e a; return $ MkTree r x b} | otherwise = Nothing
This is so tricky that I think it is worth questioning whether Haskell is helping here. First, just spotting that the first version leads to a space leak is not trivial. Second, the fix is so much uglier than the original version. It is disappointing to have to write it!
Questions: 1) ** Is there a warning sign that would easily flag this space leak? ** 2) Is there a simpler fix?
The really tricky thing here is that the space leak is not what you think it is. :) When you use the tree in an ephemeral fashion, the issue is *not* that the path to the element is being copied, because the old path is usually not referenced anywhere, and can be garbage collected immediately. Asymptotically, this has no impact on space usage. Instead, the issue is that the tree operations are not "completed". Inserting the same element a thousand times without inspecting the tree afterwards, you will get a tree that contains a huge unevaluated subtree, like MkTree (treeInstert 'a' (treeInsert 'a' ...)) 'f' Empty What your supposed fix actually does is that it fully evaluates the subtrees as well. An easier way to achieve this to use a "strictness annotation" in the data type, which is done via the `!` symbol: data Tree a = Empty | MkTree !(Tree a) a !(Tree a) This data type may not have any unevaluated subtrees. Try it and see whether it solves your problem. In general, lazy evaluation is a trade-off: It makes some programs simpler to write, but the price is that the cost model becomes more difficult, both for time and for space usage. Cost models that you may know from eager languages do not apply anymore. For reasoning about time usage, Okasaki introduces the "debit method",w hich you've certainly read about. Another exposition can be found in [1]. For reasoning about space, I recommend an approach that I like to call "space invariants" [2]. The key problem with the space leak above is that the semantics of a tree -- which elements it contains -- no longer coincides with the size of its representation -- an expression which may be only partially evaluated. For things like infinite lists, e.g. [1..], this is actually great! But for other things like finite maps, this may lead to unexpected space usage. The remedy is to enforce a "space invariant", which is a guarantee that links the semantics and the size of the representation. An example would be "A tree in WHNF uses only as much space as the leaves it contains (+ skeleton)". The annotion `!` is one way to enforce these invariants. [1]: http://apfelmus.nfshost.com/articles/debit-method.html [2]: http://apfelmus.nfshost.com/blog/2013/08/21-space-invariants.html Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Hi Heinrich, Thanks so much! :-) Wow! I was totally wrong on that one then. I got it that lazy evaluation is the culprit here. Now, I am still a little uneasy about insert producing a new version of the tree when it is not needed. Can't that come back to bite me later depending on how I use it? In your explanation, you say that "the old path is usually not referenced anywhere, and can be garbage collected immediately." I am a little uneasy about the "usually". Could you expand on that? What if it is? Could that happen? Or does that just imply a bigger constant? Thanks again, Dimitri Em 16/08/14 04:13, Heinrich Apfelmus escreveu:
Dimitri DeFigueiredo wrote:
Here's a problem I just found with Haskell (and maybe functional programming) that I would like to know how to avoid.
Please don't get me wrong. I'm in *LUUV* with Haskell! :-)
Exercise 2.3 of Chris Okasaki's book "Purely Functional Data Structures" shows that something as simple as an insert function in a binary tree can cause a space leak in Haskell. (and other functional languages as well)
Here's a simple tree type:
data Tree a = Empty | MkTree (Tree a) a (Tree a)
Here's the version of insert with the space leak:
-- ** SPACE LEAK! ** version -- this version unnecessarily copies the path to the inserted element even if -- the same element is already present in the tree. -- a million insertions of the same element will cause a million trees -- to exist whereone would suffice.
treeInsert :: Ord a => a -> Tree a -> Tree a treeInsert e Empty = MkTree Empty e Empty treeInsert e t@(MkTree a x b) | e > x = MkTree a x (treeInsert e b) | e < x = MkTree (treeInsert e a) x b | otherwise = t
Here's a version without the space leak:
treeInsert :: Ord a => a -> Tree a -> Tree a treeInsert e t = case (treeInsert' e t) of Nothing -> t Just x -> x where treeInsert' :: Ord a => a -> Tree a -> Maybe (Tree a) treeInsert' e Empty = return $ MkTree Empty e Empty treeInsert' e (MkTree a x b) | e > x = do {r <- treeInsert' e b; return $ MkTree a x r} | e < x = do {r <- treeInsert' e a; return $ MkTree r x b} | otherwise = Nothing
This is so tricky that I think it is worth questioning whether Haskell is helping here. First, just spotting that the first version leads to a space leak is not trivial. Second, the fix is so much uglier than the original version. It is disappointing to have to write it!
Questions: 1) ** Is there a warning sign that would easily flag this space leak? ** 2) Is there a simpler fix?
The really tricky thing here is that the space leak is not what you think it is. :)
When you use the tree in an ephemeral fashion, the issue is *not* that the path to the element is being copied, because the old path is usually not referenced anywhere, and can be garbage collected immediately. Asymptotically, this has no impact on space usage.
Instead, the issue is that the tree operations are not "completed". Inserting the same element a thousand times without inspecting the tree afterwards, you will get a tree that contains a huge unevaluated subtree, like
MkTree (treeInstert 'a' (treeInsert 'a' ...)) 'f' Empty
What your supposed fix actually does is that it fully evaluates the subtrees as well.
An easier way to achieve this to use a "strictness annotation" in the data type, which is done via the `!` symbol:
data Tree a = Empty | MkTree !(Tree a) a !(Tree a)
This data type may not have any unevaluated subtrees. Try it and see whether it solves your problem.
In general, lazy evaluation is a trade-off: It makes some programs simpler to write, but the price is that the cost model becomes more difficult, both for time and for space usage. Cost models that you may know from eager languages do not apply anymore.
For reasoning about time usage, Okasaki introduces the "debit method",w hich you've certainly read about. Another exposition can be found in [1].
For reasoning about space, I recommend an approach that I like to call "space invariants" [2]. The key problem with the space leak above is that the semantics of a tree -- which elements it contains -- no longer coincides with the size of its representation -- an expression which may be only partially evaluated. For things like infinite lists, e.g. [1..], this is actually great! But for other things like finite maps, this may lead to unexpected space usage. The remedy is to enforce a "space invariant", which is a guarantee that links the semantics and the size of the representation. An example would be "A tree in WHNF uses only as much space as the leaves it contains (+ skeleton)". The annotion `!` is one way to enforce these invariants.
[1]: http://apfelmus.nfshost.com/articles/debit-method.html [2]: http://apfelmus.nfshost.com/blog/2013/08/21-space-invariants.html
Best regards, Heinrich Apfelmus
-- http://apfelmus.nfshost.com
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Dimitri DeFigueiredo wrote:
Hi Heinrich,
Thanks so much! :-)
Wow! I was totally wrong on that one then. I got it that lazy evaluation is the culprit here. Now, I am still a little uneasy about insert producing a new version of the tree when it is not needed. Can't that come back to bite me later depending on how I use it?
In your explanation, you say that "the old path is usually not referenced anywhere, and can be garbage collected immediately." I am a little uneasy about the "usually". Could you expand on that? What if it is? Could that happen? Or does that just imply a bigger constant?
First, note that a new version of the tree has to be created whenever the element you insert was not in the tree. Most likely, your "higher-up" algorithm that makes use of the tree structure doesn't treat these two cases too differently, so you probably have to think about the case where the tree is copied anyway. Second, you may want to look up on how lazy evaluation, respectively *graph reduction* work. This will answer your questions. For instance, the new code that you wrote implicitly builds a new version of the tree as well (building a tower of expressions of the form `>>= \r -> return $ MkTree a x r`) but discards it level by level when it turns out that the end result is `Nothing`. Without an understanding of the evaluation model, you will always feel uneasy. The basic tenet of garbage collection is that the binding for any name which does not occur in the current expression can be discard. For instance, in the expression let x = "A String which uses a lot of memory" y = 12 in y + 2 we can safely discard the name `x` and its value, as the name `x` does not occur in the expression `y + 2`. But in the expression, let x = "A String which uses a lot of memory" y = 12 in snd (x,y+2) we may not discard `x` just yet, because it still occurs in the expression `snd (x,y+2)`. Of course, performing reduction step will turn this into `y+2` and now the memory used by binding for `x` can be freed. Note that the runtime may choose to wait some time before freeing the associated memory. In fact, at periodic intervals, the runtime will stop the current evaluation and instead perform a process called "garbage collection", where it looks for all variable names and frees the memory of those that are no longer needed. A "space leak" generally refers to the situation where a name is no longer needed but still occurs in the expression to be evaluated. This is very much like the `x` in the second example: it's "not really needed", because we humans know that `snd` doesn't use it, but it's still written there, so the runtime can't throw it away. Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Hi: Heinrich Apfelmus wrote: "But in the expression, let x = "A String which uses a lot of memory" y = 12 in snd (x,y+2) we may not discard `x` just yet, because it still occurs in the expression `snd (x,y+2)`. Of course, performing reduction step will turn this into `y+2` and now the memory used by binding for `x` can be freed." Wouldn't the compiler know how "snd" works and therefore know that the first argument isn't needed? Isn't the above the advantage of pure code - same inputs ~ same outputs? On Tue, Aug 19, 2014 at 7:42 AM, Heinrich Apfelmus < apfelmus@quantentunnel.de> wrote:
Dimitri DeFigueiredo wrote:
Hi Heinrich,
Thanks so much! :-)
Wow! I was totally wrong on that one then. I got it that lazy evaluation is the culprit here. Now, I am still a little uneasy about insert producing a new version of the tree when it is not needed. Can't that come back to bite me later depending on how I use it?
In your explanation, you say that "the old path is usually not referenced anywhere, and can be garbage collected immediately." I am a little uneasy about the "usually". Could you expand on that? What if it is? Could that happen? Or does that just imply a bigger constant?
First, note that a new version of the tree has to be created whenever the element you insert was not in the tree. Most likely, your "higher-up" algorithm that makes use of the tree structure doesn't treat these two cases too differently, so you probably have to think about the case where the tree is copied anyway.
Second, you may want to look up on how lazy evaluation, respectively *graph reduction* work. This will answer your questions. For instance, the new code that you wrote implicitly builds a new version of the tree as well (building a tower of expressions of the form `>>= \r -> return $ MkTree a x r`) but discards it level by level when it turns out that the end result is `Nothing`. Without an understanding of the evaluation model, you will always feel uneasy.
The basic tenet of garbage collection is that the binding for any name which does not occur in the current expression can be discard. For instance, in the expression
let x = "A String which uses a lot of memory" y = 12 in y + 2
we can safely discard the name `x` and its value, as the name `x` does not occur in the expression `y + 2`.
But in the expression,
let x = "A String which uses a lot of memory" y = 12 in snd (x,y+2)
we may not discard `x` just yet, because it still occurs in the expression `snd (x,y+2)`. Of course, performing reduction step will turn this into `y+2` and now the memory used by binding for `x` can be freed.
Note that the runtime may choose to wait some time before freeing the associated memory. In fact, at periodic intervals, the runtime will stop the current evaluation and instead perform a process called "garbage collection", where it looks for all variable names and frees the memory of those that are no longer needed.
A "space leak" generally refers to the situation where a name is no longer needed but still occurs in the expression to be evaluated. This is very much like the `x` in the second example: it's "not really needed", because we humans know that `snd` doesn't use it, but it's still written there, so the runtime can't throw it away.
Best regards, Heinrich Apfelmus
-- http://apfelmus.nfshost.com
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
-- -- Sent from an expensive device which will be obsolete in a few months! :D Casey

On Tue, Aug 19, 2014 at 11:34 AM, KC
Wouldn't the compiler know how "snd" works and therefore know that the first argument isn't needed?
No, snd is not a compiler built-in. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

KC wrote:
Heinrich Apfelmus wrote:
"But in the expression,
let x = "A String which uses a lot of memory" y = 12 in snd (x,y+2)
we may not discard `x` just yet, because it still occurs in the expression `snd (x,y+2)`. Of course, performing reduction step will turn this into `y+2` and now the memory used by binding for `x` can be freed."
Wouldn't the compiler know how "snd" works and therefore know that the first argument isn't needed?
Isn't the above the advantage of pure code - same inputs ~ same outputs?
What does it mean for a compiler to know something? Evaluation of a Haskell program follows a deterministic process. In the case of lazy evaluation, it is: Repeatedly find the outermost leftmost redex and reduce it, occasionally do a garbage collection to remove unused bindings (graphs). That's all there is to it. In this context, the example I gave above retains the binding `x` slightly longer than strictly necessary from a semantic point of view. Of course, before executing a Haskell program, the compiler can try to analyze it and find transformations that may make its subsequent execution more efficient. For instance, the fact that `snd` does not depend on the first component of the pair can be caught by strictness analysis, and the compiler might choose to inline it. However, once the compiler is done with the optimization phase, it will produce a program that follows the lazy evaluation strategy to its bitter -- or joyous -- end, space leaks and all. Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com
participants (5)
-
Brandon Allbery
-
Dimitri DeFigueiredo
-
Heinrich Apfelmus
-
KC
-
Mateusz Kowalczyk