Re: Proposal: add laziness to Data.Map / IntMap

Hi Bertram Scott Dillard wrote:
I think maybe you guys (Don and Andrew) are misunderstanding my proposal. The lazy/strict tradeoff is a subtle issue, and I'll be sure to re-read Okasaki's stuff with this in mind, but what I'm talking about here is not a trade off. It's laziness "for free". Move the strictness annotations out of the constructors and into the library functions using 'seq'. Laziness is exposed through _separate_ functions.
It's not for free. When the compiler does a pattern match on the Bin constructor, Bin sz kx x l r, it can no longer assume that l and r are fully evaluated, so it has to add code to evaluate them in case they are not. And in fact, this code will be needed if any of your proposed lazy functions are added later. I have not checked whether this has a measurable performance or code size impact.
So mapWithKey retains all semantics, including guarantees.
Semantically the change is safe, agreed.
You're right of course (hadn't thought about that) but I'd like to point out that Adrien Hey's AVL tree library is structured in the way I propose: A node's subtree fields are not marked strict, instead seq is used to force construction of the tree. I think it's generally faster than Data.Map, at least it's claimed to be so in the documentation. The trees are not built with the same algorithm, but if the use of seq instead of bangs does impose a significant overhead, then that speaks very highly to the AVL algorithm. This of course raises the question of why I don't just use that library. I'm checking it out, but it's quite a bit bigger than Data.Map. The thing I like about Data.Map is that it's relatively simple and I can see whats going on very quickly, enough to make modifications (which I'm about to do.) Reading Data.Tree.AVL is a more daunting task. Also, there's something to be said for "lazy by default." I don't see much difference between Data.List.map and Data.Map.map (for finite lists anyway) and so if laziness is the correct default for the former then it should also be for the latter. The current Data.Map.map is only half strict: it builds an entire tree and fills it with thunks. Better I think to either build the whole thing completely or build only what's needed. The current Data.Map.map is the worst of both, no laziness + heap burn. So I guess I'll start writing. Anyone have any good benchmarks? Which naming scheme is less ugly, Data.Map.mapWithKeyL or Data.Map.Lazy.mapWithKey? Any other suggestions would be appreciated. Scott

Which naming scheme is less ugly, Data.Map.mapWithKeyL or Data.Map.Lazy.mapWithKey?
A separate module is much better, as it will allow switching entire modules over just by changing an import, and still allow something close to the mapWithKeyL usage via a qualified import (e.g. "import qualified Data.Map.Lazy as L") Ganesh ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ==============================================================================

Scott Dillard wrote:
You're right of course (hadn't thought about that) but I'd like to point out that Adrien Hey's AVL tree library is structured in the way I propose: A node's subtree fields are not marked strict, instead seq is used to force construction of the tree. I think it's generally faster than Data.Map, at least it's claimed to be so in the documentation. The trees are not built with the same algorithm, but if the use of seq instead of bangs does impose a significant overhead, then that speaks very highly to the AVL algorithm.
Using explicit seqs rather than strict data types is actually faster, for reasons that are a bit of a mystery to me. I'm not sure what cost Bertram is talking about, but AFAIK ghc uses the same info pointer mechanism for all heap records, including unevaluated thunks (although the info pointers will point to different things of course). But the cost of pattern matching on *evaluated* AVL nodes should be independent of strictness annotation AFAICS. Regards -- Adrian Hey

Adrian Hey wrote:
Using explicit seqs rather than strict data types is actually faster, for reasons that are a bit of a mystery to me. I'm not sure what cost Bertram is talking about, but AFAIK ghc uses the same info pointer mechanism for all heap records, including unevaluated thunks (although the info pointers will point to different things of course). But the cost of pattern matching on *evaluated* AVL nodes should be independent of strictness annotation AFAICS.
I must admit that for the time being the cost is of a theoretical nature. But let me explain the idea. Consider this code:
module Nat (isOne) where
data Nat = Succ !Nat | Zero
isOne :: Nat -> Bool isOne n = case n of Zero -> False Succ n' -> case n' of Zero -> True Succ _ -> False
The code of isOne starts by forcing n (looking at n's tag and entering the closure if it's unevaluated in ghc's case) and then a pattern match (looking at the the tag again). Now for the second pattern match, we can skip the first step, because we know that n' is fully evaluated, thanks to the strictness annotation in the Succ constructor. However, ghc currently does generate the same (cmm) code for isOne regardless of the strictness annotation, so performance wise you only get to pay the price of the annotation (I expect that some thunks are unnecessarily reevaluated when the constructor is used) without this benefit. Did I miss any reason why this idea can't work? I really expected ghc to do that optimisation - obviously that was wishful thinking on my part. Bertram

This of course raises the question of why I don't just use that library. I'm checking it out, but it's quite a bit bigger than Data.Map. The thing I like about Data.Map is that it's relatively simple and I can see whats going on very quickly, enough to make modifications (which I'm about to do.) Reading Data.Tree.AVL is a more daunting task.
I agree. Have a look at code.haskell.org/gmap/api . The Data.GMap.OrdMap module is a wrapper around AvlTree, using a slightly more complete interface than Data.Map . It should make an easier starting point. In fact, I suspect that if you just make a new AvlTreeL by deleting every occurrence of `seq` and then OrdMapL by changing the import in OrdMap everything would work seamlessly. The current version of OrdMap should be fairly safe to use. I'll be putting a proper package on hackage in a week or two once Ive tidied everything up. Jamie
participants (5)
-
Adrian Hey
-
Bertram Felgenhauer
-
Jamie Brandon
-
Scott Dillard
-
Sittampalam, Ganesh