Proposal: add laziness to Data.Map / IntMap

Hi, I found myself wanting a lazy version of Data.Map today, and by "lazy" I mean in the node-subtree pointers. I trust the library writers when they put strictness annotations in the fields of the tree nodes, so I'm wondering what the various implications of lazy subtrees are, beyond the obvious speed hit. Does this cause space leaks beyond what lazy value pointers can already cause? Can someone point me to some reading that discusses this? Anyway, I'm positing to libraries (rather than haskell-cafe) to gauge opinion about a possible rewrite of Data.Map and IntMap to remove strictness annotations (bangs) from the node constructors and move them into the functions (as seqs). "Rewrite" is maybe too strong of a word. "Significant patch" is more like it. It would involve only those functions that construct Bin values directly, which is not that many. Less than a days work, I think (yes that means I volunteer.) Semantics of everything remains unchanged, but it opens up the possibility for lazy versions of some functions. The most usefull result of this would be a lazy map (little m). Here's Data.Map.mapWithKey mapWithKey f Tip = Tip mapWithKey f (Bin sx kx x l r) = Bin sx kx (f kx x) (mapWithKey f l) (mapWithKey f r) De-banged and then restrictified, it would look like this mapWithKey f Tip = Tip mapWithKey f (Bin sx kx x l r) = seq l' $ seq r' $ Bin sx kx (f kx x) l' r' where l = (mapWithKey f l) r = (mapWithKey f r) Looking at the first version, clearly you see that when constructing a new map you should only have to pay for the sub trees that you actually use. If you perform only a handful of lookups then throw the new tree away, why build the whole thing? To further motivate, let me explain my use case. I have cyclical data structures (graphs, more or less) that I mutate frequently, so I store them in a Map, indexed by some Ord thing, lets say Int, so I'd have something like Map Int [Int] (but not that exactly, and nothing like Data.Graph). This is great for mutations because I can use backtracking, but for lookups it's a burden on both me and the cpu. So I memoize the thing into something like "data Node a = Node a [Node a]" I can do this memoization using Data.Map.mapWithKey, with the Nodes built therein referring back to the produced Map. But then, what if I only crawl a small portion of this cyclical network of Nodes? Why should I have to pay for the whole thing to be rebuilt? It defeats the purpose of the memoization, which is to amortize the cost of following edges in the mutable graph. The pro and con as I see it are: Pro - More flexible data structure Con - Code is more verbose (see Data.Tree.AVL) - Only a few (but important) functions can be made lazy To that last point, note that while mapWithKey can be made lazy for both Map and IntMap, only IntMap allows lazy filter and mapMaybe because it doesn't rebalance. But I'm wondering how much of the tree needs to be forced when rebalancing. Should be only O(log n), right? It also becomes important where the tree is sourced from. The source needs to produce the tree lazily. The regular definition of fromList (= foldr (uncurry insert) empty) admits no laziness, but maybe successive unions could if the sub-maps were nearly disjoint (a not-uncommon case I think.) Does anyone know if any benchmarking has been done to this end? Finally, I'll stress once more that the semantics of the functions currently exported would be unchanged. This would only allow new lazy versions, named something like mapWithKeyL or unionL. So what do you think? Too much for too little? Scott

sedillard:
Hi,
I found myself wanting a lazy version of Data.Map today, and by "lazy" I mean in the node-subtree pointers. I trust the library writers when they put strictness annotations in the fields of the tree nodes, so I'm wondering what the various implications of lazy subtrees are, beyond the obvious speed hit. Does this cause space leaks beyond what lazy value pointers can already cause? Can someone point me to some reading that discusses this?
Anyway, I'm positing to libraries (rather than haskell-cafe) to gauge opinion about a possible rewrite of Data.Map and IntMap to remove strictness annotations (bangs) from the node constructors and move them into the functions (as seqs). "Rewrite" is maybe too strong of a word. "Significant patch" is more like it. It would involve only those functions that construct Bin values directly, which is not that many. Less than a days work, I think (yes that means I volunteer.) Semantics of everything remains unchanged, but it opens up the possibility for lazy versions of some functions.
How about doing it as a separate library, then we can choose either strict or lazy as the case may be? -- Don

I think maybe you guys (Don and Andrew) are misunderstanding my proposal.
The lazy/strict tradeoff is a subtle issue, and I'll be sure to re-read
Okasaki's stuff with this in mind, but what I'm talking about here is not a
trade off. It's laziness "for free". Move the strictness annotations out of
the constructors and into the library functions using 'seq'. Laziness is
exposed through _separate_ functions.
I'll copy again my proposed versions of mapWithKey (because I made a typo
the first time)
mapWithKey :: (k -> a -> b) -> Map k a -> Map k b
mapWithKey _ Tip = Tip
mapWithKey f (Bin sx kx x l r)
= seq l' $ seq r' $ Bin sx kx (f kx x) l' r'
where l' = mapWithKey f l
r' = mapWithKey f r
mapWithKeyLazy _ Tip = Tip
mapWithKeyLazy f (Bin sx kx x l r)
= Bin sx kx (f kx x) (mapWithKey f l) (mapWithKey f r)
So mapWithKey retains all semantics, including guarantees. So would insert,
and all other functions. You export a second API (maybe in a nested module)
that exposes laziness. Writing another library is kind of silly since a) you
want stricness 90% of the time b) it shares 90% of the code. If maintainers
are willing to deal with some extra seqs here and there then we can have
both libraries in one.
Scott
On Mon, Aug 4, 2008 at 6:18 PM, Don Stewart
Hi,
I found myself wanting a lazy version of Data.Map today, and by "lazy" I mean in the node-subtree pointers. I trust the library writers when
put strictness annotations in the fields of the tree nodes, so I'm wondering what the various implications of lazy subtrees are, beyond
obvious speed hit. Does this cause space leaks beyond what lazy value pointers can already cause? Can someone point me to some reading that discusses this?
Anyway, I'm positing to libraries (rather than haskell-cafe) to gauge opinion about a possible rewrite of Data.Map and IntMap to remove strictness annotations (bangs) from the node constructors and move
sedillard: they the them
into the functions (as seqs). "Rewrite" is maybe too strong of a word. "Significant patch" is more like it. It would involve only those functions that construct Bin values directly, which is not that many. Less than a days work, I think (yes that means I volunteer.) Semantics of everything remains unchanged, but it opens up the possibility for lazy versions of some functions.
How about doing it as a separate library, then we can choose either strict or lazy as the case may be?
-- Don

Scott Dillard wrote:
I think maybe you guys (Don and Andrew) are misunderstanding my proposal. The lazy/strict tradeoff is a subtle issue, and I'll be sure to re-read Okasaki's stuff with this in mind, but what I'm talking about here is not a trade off. It's laziness "for free". Move the strictness annotations out of the constructors and into the library functions using 'seq'. Laziness is exposed through _separate_ functions.
It's not for free. When the compiler does a pattern match on the Bin constructor, Bin sz kx x l r, it can no longer assume that l and r are fully evaluated, so it has to add code to evaluate them in case they are not. And in fact, this code will be needed if any of your proposed lazy functions are added later. I have not checked whether this has a measurable performance or code size impact.
So mapWithKey retains all semantics, including guarantees.
Semantically the change is safe, agreed. regards, Bertram

G'day all.
Quoting Scott Dillard
I found myself wanting a lazy version of Data.Map today, and by "lazy" I mean in the node-subtree pointers.
Right. Just to be clear, to start off with: - It makes no sense for "keys" to be lazy, because they need to be inspected to determine the shape of the data structure. (Except in the case of a singleton map, if you know where in the tree some (key,value) pair goes, then you've already evaluated the key.) - It's better for "values" to be lazy in a general-purpose map-type data structure, because making them strict breaks Functor. So the remaining interesting question is the internal structure pointers.
I trust the library writers when they put strictness annotations in the fields of the tree nodes, so I'm wondering what the various implications of lazy subtrees are, beyond the obvious speed hit. Does this cause space leaks beyond what lazy value pointers can already cause? Can someone point me to some reading that discusses this?
Yes, please read Chris Okasaki's "Purely Functional Data Structures" for a fuller discussion of the tradeoffs of putting laziness in different places in a data structure. Making internal pointers strict vs making them lazy doesn't necessarily buy you much in the way of raw-cycle-counting performance. What it buys you is a complexity _guarantee_, which in Haskell is often more valuable. Thinking of a binary search tree for a moment, making the internal pointers lazy means that insertion is always O(1), but lookup may take an arbitrary amount of time (though it will be O(log n) amortised). It also adds a raw-cycle-counting cost to every lookup, even if the tree is fully evaluated. This is the opposite of the usual desired performance. Dictionary implementations tend to assume that lookups are more common than insertions and deletions, and correspondingly, clients tend to assume that insertions and deletions are more expensive than lookups. If these assumptions don't match your code, then indeed, you may be using the wrong data struture.
Looking at the first version, clearly you see that when constructing a new map you should only have to pay for the sub trees that you actually use. If you perform only a handful of lookups then throw the new tree away, why build the whole thing?
If you only perform a handful of lookups, I question whether you actually wanted a binary search tree to begin with. Wouldn't an association list have done the job just as well? Or compiling to functions?
To further motivate, let me explain my use case. [...] So I memoize the thing into something like "data Node a = Node a [Node a]"
Right. Memoising CAFs is an excellent example of one of those very few places where writing your own dictionary data type can make a lot of sense. Why? Because there are a lot of things that you don't need from, say, AVL trees. For example, you know all the keys in advance, which means that your memo table won't need to be modified once it's built. You don't even need insertions, let alone deletions or a Functor instance. Have you tried just using functions? Something like this: -- WARNING: Untested code follows. Use at own risk. type MyMap k v = k -> v -- k -> Maybe v may also be appropriate. myMapFromSortedAssocList :: (Ord k) => [(k,v)] -> MyMap k v myMapFromSortedAssocList kvs = buildMap (length kvs) kvs where errorCase = error "Can't find key" -- Feel free to optimise additional base cases if desired. buildMap _ [] = \key -> errorCase buildMap _ [(k,v)] = \key -> if k == key then v else errorCase buildMap l kvs = let l2 = l `div` 2 (kvsl,(k,v):kvs2) = splitAt l2 kvs mapl = buildMap l2 kvs1 mapr = buildMap (l - l2 - 1) kvs2 in \key -> case compare key k of LT -> mapl key GT -> mapr key EQ -> v (Exercise for intermediate-level Haskellers: Why is "key" bound by an explicit lambda?) Cheers, Andrew Bromage
participants (4)
-
ajb@spamcop.net
-
Bertram Felgenhauer
-
Don Stewart
-
Scott Dillard