Re: export toDescList from Data.Map

Date: Tue, 23 Sep 2008 09:09:44 -0700 From: "Evan Laforge"
Subject: Re: export toDescList from Data.Map Actually, Data.Map, Data.IntMap, Data.Set and Data.IntSet do not have an active maintainer and maybe everybody waits for "GSoC thing for tries".
Indeed, as you noted yourself "it would be nice to fix up Map, Set, IntMap, and IntSet". I think it is almost mandatory. So would you do this, too? "toDescList" makes sense for sets, too. On the other hand, I was against toAscList as it is identical to toList. That means I always assumed a bias towards "ascending operations".
I wouldn't mind doing some cleanup...
I use Map, Set, IntMap and IntSet very much, and I'm very interested in seeing to it that these important libraries are maintained and improved. However, short of a major API change, like what's going on with generalized tries and what's already been done with Edison, I don't think there is much work left to be done with these libraries at the API level. Just minor changes like the one you've proposed. There are some things that could be done though... - I'm not sure what Benedikt Huber meant by 'view', but I think he means exposing the tree structure, in a read-only way, to users. I think its a shame that the Map/Set libraries do not expose this. The simplest solution would be to have toTree functions that convert it to a Data.Tree, but I don't think anybody actually likes that data type. A specialized binary tree data type would be more elegant, but there is an issue of how data is stored in the tree. Map/Set store keys and values in internal nodes and use null leaves. IntMap/IntSet store all keys/values in non-null leaves. So maybe this TreeView type would have to be specific to either Map or IntMap. The idea here is that the algorithms and data structures used for these trees are well-known. The papers are linked from the documentation. So the library should expose this to users, in a safe way. - Since someone raised the issue of test suites for performance and correctness, I think it would also be interesting to investigate what effect the strictness annotations in the tree constructors have on performance. Everyone takes for granted that bangs=faster, but I've noticed, as have others, that removing these strictness annotations actually make things run faster. Instead, the tree construction _functions_ should be made strict using seq. The Map construction functions are already strict enough on account of the balancing. Try it for yourself, remove the bangs from the sub-tree fields in the Bin constructor, and run a little benchmark. This would open up the possibility for lazy versions of functions like mapWithKey and mapKeysMonotonic. I had mentioned this previously, but few seemed to care, so I just made the change locally. http://www.haskell.org/pipermail/libraries/2008-August/010371.html . If we we're going to start a little Map/Set/IntMap/IntSet working group, then I'd like to throw that idea back into the mix.
It's also galling that IntMap is hardcoded to 32bit ints and writing a 64 bit version would seem to require yet another copy and paste session. And it seems like theoretically patricia should work on doubles too. But I'm not sure how to address all this code sharing.
IntMap is hard coded to Int, which is 32 bits on a 32 bit architecture, and 64 bits on a 64 bit architecture. I think the main reason for this is so that the key can be unpacked/specialized for extra performance, since that is the primary purpose of the library. If you just need sublinear insert/lookup/delete then use a Map. Are you galled that the key type is not hardcoded to Int64, or that the key is not allowed to be any instance of the Bits class? In the former case, you'd inflate the size of the tree and take a small speed hit on 32bit archs, and in the latter I think you'd have even worse performance. I think the choice of Int is wise because it is the fastest, and that is the point of the library, and if you think about it you're not going to be dealing with more that 2^32 Ints on a 32 bit computer. (What would they be referencing? Not array positions or file offsets.) If you're trying to optimize, say, "Map (Int32,Int32) a" into "IntMap64 a", there are other ways to accomplish that: import qualified Data.IntMap as IM import Data.IntMap (IntMap) newtype IntTrie = IntTrie (IntMap IntTrie) empty :: IntTrie empty = IntTrie IM.empty insert :: [Int] -> IntTrie -> IntTrie insert [] t = t insert (x:xs) (IntTrie t) = IntTrie t' where (y,t') = IM.insertLookupWithKey (const union) x y' t y' = case y of Just y -> insert xs y; Nothing -> insert xs empty delete :: [Int] -> IntTrie -> IntTrie delete [] t = empty delete (x:xs) (IntTrie t) = IntTrie (IM.update f x t) where f y = case IM.delete x (case delete xs y of IntTrie t -> t) of t' | IM.null t' -> Nothing | otherwise -> Just (IntTrie t') And so forth. This is an IntSet for arbitrarily long integers represented as [Int]. An IntMap is probably not much more work.
Maybe this explains the little feedback to this ticket, but I would like to encourge you, to improve things, anyway.
Well thanks. It's nice to hear *something* :)
For my $0.02 your proposal is good. toDescList is very important to have in many situations. I've had it un-hidden locally for a while now. It's silly that it wasn't exported in the first place. Scott

Hi, Scott Dillard schrieb:
- I'm not sure what Benedikt Huber meant by 'view', but I think he means exposing the tree structure, in a read-only way, to users. I think its a shame that the Map/Set libraries do not expose this. The simplest solution would be to have toTree functions that convert it to a Data.Tree, but I don't think anybody actually likes that data type. A specialized binary tree data type would be more elegant, but there is an issue of how data is stored in the tree. Map/Set store keys and values in internal nodes and use null leaves. IntMap/IntSet store all keys/values in non-null leaves. So maybe this TreeView type would have to be specific to either Map or IntMap. The idea here is that the algorithms and data structures used for these trees are well-known. The papers are linked from the documentation. So the library should expose this to users, in a safe way.
I think that having a function
treeView :: Map k v -> T (k,v)
s.t. T is an instance of Foldable, supports efficient left and right folds, and additional operations like subrange queries (all pairs (k,v) s.t. l <= k <= u) would be useful. I'd like to have all functions from Data.Foldable available for folds with key, e.g fold[lr]MWithKey. Supporting toTree (e.g. using a binary tree as you suggested) would be great as well, but I think T (k,v) does not need to be build an intermediate representation for supporting queries/folds, so a newtype should do as well.
- Since someone raised the issue of test suites for performance and correctness, I think it would also be interesting to investigate what effect the strictness annotations in the tree constructors have on performance. ... http://www.haskell.org/pipermail/libraries/2008-August/010371.html
For my $0.02 your proposal is good. toDescList is very important to have in many situations. I've had it un-hidden locally for a while now. It's silly that it wasn't exported in the first place. +1 for toDescList foldrWithKey is certainly useful as well, but I think one should also
Did you also measure the effect of removing strictness annotations on space performance ? think about monadic folds (and the other stuff from Data.Foldable), as explained above. benedikt

On Fri, Sep 26, 2008 at 04:12:17PM +0200, Benedikt Huber wrote:
I think that having a function
treeView :: Map k v -> T (k,v)
s.t. T is an instance of Foldable, supports efficient left and right folds, and additional operations like subrange queries (all pairs (k,v) s.t. l <= k <= u) would be useful. I'd like to have all functions from Data.Foldable available for folds with key, e.g fold[lr]MWithKey.
Map has split and splitLookup for subrange queries, and you could get the folds with mapWithKey (,) :: Map k v -> Map k (k,v)

Ross Paterson schrieb:
On Fri, Sep 26, 2008 at 04:12:17PM +0200, Benedikt Huber wrote:
I think that having a function
treeView :: Map k v -> T (k,v) s.t. T is an instance of Foldable, supports efficient left and right folds, and additional operations like subrange queries (all pairs (k,v) s.t. l <= k <= u) would be useful. I'd like to have all functions from Data.Foldable available for folds with key, e.g fold[lr]MWithKey.
Map has split and splitLookup for subrange queries, and you could get the folds with
mapWithKey (,) :: Map k v -> Map k (k,v)
Thanks for the pointing this out. While browsing Data.Map I noted that the Foldable instance only defines foldMap - it is notably slower to use Data.Map - Foldable.foldr compared to Map.fold. Adding
instance Foldable (Map k) where foldMap = ... foldr f = foldr (const f) foldl f = foldl (\b _ a -> f b a)
really helps. There is also a notable performance penalty using mapWithKey as you suggested, but otherwise it does the job. I still think that toTree or treeView would be interesting, but others might disagree of course. For range queries, split should indeed be sufficient, though a specialized function might be faster. Finally I've attached a (very simple) microbenchmark to demonstrate the need for adding foldr,foldl to the Foldable instance of Data.Map. best regards, benedikt -- Timings {without additional Foldable definitions}/{with foldl,foldr definitions for foldable} -- fold asc-list -- {1.35} testFold_1 = foldr (uncurry f) 0 . M.toAscList -- foldWithKey (foldWithKey) -- {1.45} testFold_2 = M.foldWithKey f 0 -- fold M.mapWithKey (,) -- {9.0 / 3.25} testFold_3 = foldr (uncurry f) 0 . M.mapWithKey (,) -- fold without key (Foldable.foldr) -- {5.06 / 1.35} testFold_4 = foldr (f 0) 0 -- fold (elems.mapWithKey) -- {3.75} testFold_6 = foldr (uncurry f) 0 . M.elems . M.mapWithKey (,) f k v b = b*7+k+2*v main = go testFold_1 0 (M.empty) where go f 5000 = putStrLn "" go f k m = putStr (show (f m) ++ " ") >> go f (succ k) (M.insert k (2*k) m)

- Since someone raised the issue of test suites for performance and correctness, I think it would also be interesting to investigate what effect the strictness annotations in the tree constructors have on performance. Everyone takes for granted that bangs=faster, but I've noticed, as have others, that removing these strictness annotations actually make things run faster. Instead, the tree construction _functions_ should be made strict using seq. The Map construction functions are already strict enough on account of the balancing. Try it for yourself, remove the bangs from the sub-tree fields in the Bin constructor, and run a little benchmark. This would open up the possibility for lazy versions of functions like mapWithKey and mapKeysMonotonic. I had mentioned this previously, but few seemed to care, so I just made the change locally. http://www.haskell.org/pipermail/libraries/2008-August/010371.html . If we we're going to start a little Map/Set/IntMap/IntSet working group, then I'd like to throw that idea back into the mix.
Of course, anything that boosts performance is a win. Does anyone know of any Map benchmarks? I suppose I could write some, but I'd be sort of surprised if someone else hasn't already done that. Then we could throw in all the various avl / redblack / whatever map variants in there and see how they stack up.
IntMap is hard coded to Int, which is 32 bits on a 32 bit architecture, and 64 bits on a 64 bit architecture. I think the main reason for this is so that the key can be unpacked/specialized for extra performance, since that is the primary purpose of the library. If you just need sublinear insert/lookup/delete then use a Map.
To be honest, I had no particular rational reason for wanting IntMap (and in fact I'm not using it). I just heard some mumbling on the list about "Data.Map so slow, IntMap so much faster especially for unions". Though to be honest, probably what most matters to me is how much garbage an insert generates (and similarly how much the resulting structures will share). I didn't run any tests or anything, so I wasn't that upset to not be able to use IntMap :)
Are you galled that the key type is not hardcoded to Int64, or that the key is not allowed to be any instance of the Bits class? In the former case, you'd inflate the size of the tree and take a small speed hit on 32bit archs, and in the latter I think you'd have even worse performance. I think the choice of Int is wise because it is the fastest, and that is the point of the library, and if you think about it you're not going to be dealing with more that 2^32 Ints on a 32 bit computer. (What would they be referencing? Not array positions or file offsets.) If you're trying to
Points in time. I was using Int64 at the time, but I'm actually using Doubles now. Plain Data.Map is working fine in practice.
optimize, say, "Map (Int32,Int32) a" into "IntMap64 a", there are other ways to accomplish that:
Ah yes, splitting the numbers would work for 64 bit words.
participants (4)
-
Benedikt Huber
-
Evan Laforge
-
Ross Paterson
-
Scott Dillard