FAO Ketil Malde [Fwd: Re: Data.Map, Data.IntMap documentation]

Hello Folks,
Sorry to post this here but my ISP seems to have been blacklisted by
University of Bergen SpamAssassin or something and I didn't want to
appear to ignore Ketil.
-------- Original Message --------
Subject: Re: Data.Map, Data.IntMap documentation
Date: Mon, 20 Aug 2007 09:05:56 +0100
From: Adrian Hey
On Thu, 2007-08-16 at 11:52 +0100, Adrian Hey wrote:
http://darcs.haskell.org/packages/collections-ghc6.6/Data/Map/AVL.hs http://darcs.haskell.org/packages/collections-ghc6.6/Data.Trie.General/Data/...
I have some Map-intensive code which may be a useful benchmark for these (i.e. most of the time is spent building and then doing lookups in a large Map - 100s of Mb, typically).
Is AVL compatible enough that I would only need to change the imports to test this?
It should be, unless you're using indexing or something. Otherwise you might get some deprecation warnings (but those functions that are deprecated are still implemented). Also, the Monoid instance is different. Also, size is O(n), not O(1). (But the only reason this is O(1) with Data.Map is that you pay extra for this information on every operation that created the Map in the first place).
What kind of performance would you expect compared to the regular Map?
It depends what your doing. There are some aspects of the current Data.Map that will cause sucky performance in some circumstances, but if your not falling foul of those then I would expect a modest but not particularly dramatic performance increase. Particular problems with Data.Map seem to be... 1 - Poor balancing can mean it can take over twice as long to grow a tree by insertion in "pathological" cases (keys are in fact sorted but you don't know in advance that they are sorted). 2 - For union,intesection etc. Hedge algorithm seems to require something 4 to 5 times as many comparisons as divide and conquer for large maps. Of course what effect this actually has depends on key type and cost of comparison. The clone also gives you much better strictness contol, which may have some influence on performance, space behaviour etc for better or worse (but only if you modify your code to use it of course :-). The only possible adverse factor with the clone is that there's an extra level of indirection overhead in that the underlying implementation is an AVL tree of boxed (key,value) pairs. Again how significant this is depends on key type and comparison costs. For lookup it may be significant for simple keys with cheap comparison. There's also some advantage in reduced heap burn rate as a result of this during insertion etc.. (smaller tree records), and of course better balancing also helps keep paths shorter and heap burn rate down. Anyway, it'd be good to get some feedback re what effect it has on your program, or even whether or not it works :-). It hasn't been given any real thorough tests as such (but then again neither has Data.Map if bug reports are anything to go by). But it's a fairly simple wrapper around AVL (which has been thoroughly tested), so hopefully there's not too much wrong with it. If you try it you should checkout the darcs repository (the hackage version is out of date and nothing like the version in the repository). Regards -- Adrian Hey

Adrian Hey wrote:
Is AVL compatible enough that I would only need to change the imports to test this?
Also, size is O(n), not O(1). (But the only reason this is O(1) with Data.Map is that you pay extra for this information on every operation that created the Map in the first place).
Since it's a fairly well balanced tree, I would think there could be an "approximate size" in O(log(n)) time? Of course this would be another exported function... would making it produce equal results for all equal maps be difficult? Also there is are "small length"-like functions that tell what the size of something is, up to a point, or "more than that"; this can probably be done with lazy "elems"? (e.g. minMapSize of two maps, at O(min(m,n)) Isaac

Isaac Dupree wrote:
Since it's a fairly well balanced tree, I would think there could be an "approximate size" in O(log(n)) time? Of course this would be another exported function... would making it produce equal results for all equal maps be difficult?
I did consider something like this. It's certainly true that you could derive definite upper and lower bounds on size in O(log n) time. Trouble is, the result of this would depend of details of tree structure which AVL and Data.Map make non-observable by defining equality in terms of (association) lists. So if a function like this was exposed you could have two "equal" trees (Maps) that gave different answers. I should mention that there's no particular technical reason why an AVL based Data.Map implementation could not provide O(1) size and O(log n) indexing. I just didn't think it was worth the extra cost and effort on my part. It would still cost O(n) in extra heap storage and indexing a Map is bit suspect IMO. The only reasonable use of indexing is covered by the clones OMap type or the AVL zipper I think, and I couldn't think of any good reason why anybody would want the size of a Map unless they were about to do something that was O(n) anyway (or worse).
Also there is are "small length"-like functions that tell what the size of something is, up to a point, or "more than that"; this can probably be done with lazy "elems"? (e.g. minMapSize of two maps, at O(min(m,n))
Yes, something like this would be possible and easy. I'll think about adding it. Regards -- Adrian Hey

Adrian Hey wrote:
Isaac Dupree wrote:
Since it's a fairly well balanced tree, I would think there could be an "approximate size" in O(log(n)) time? Of course this would be another exported function... would making it produce equal results for all equal maps be difficult?
I did consider something like this. It's certainly true that you could derive definite upper and lower bounds on size in O(log n) time.
Trouble is, the result of this would depend of details of tree structure which AVL and Data.Map make non-observable by defining equality in terms of (association) lists. So if a function like this was exposed you could have two "equal" trees (Maps) that gave different answers.
Yes, assuming we can't come up with some clever way to avoid that (and it's not seeming possible to me at the moment)
I should mention that there's no particular technical reason why an AVL based Data.Map implementation could not provide O(1) size and O(log n) indexing. I just didn't think it was worth the extra cost and effort on my part.
It would still cost O(n) in extra heap storage and indexing a Map is bit suspect IMO. The only reasonable use of indexing is covered by the clones OMap type or the AVL zipper I think, and I couldn't think of any good reason why anybody would want the size of a Map unless they were about to do something that was O(n) anyway (or worse).
Maybe a use being "if it's small, do something O(n^2)" or generally, choosing algorithm efficiency based on size (maybe random sampling if it's _really_ big...). I don't suppose there's an easy way to offer caching of information like that to those who have a use for it AT THE SAME TIME AS efficiency, for those who don't... I don't see a use for _indexing_ either. Reminds be a bit of C++ Boost's multi_container (I think) that could be a map and a sequence and another kind of map all at the same time, sort of...
Also there is are "small length"-like functions that tell what the size of something is, up to a point, or "more than that"; this can probably be done with lazy "elems"? (e.g. minMapSize of two maps, at O(min(m,n))
Yes, something like this would be possible and easy. I'll think about adding it.
Consider that you need something isomorphic to the lazy Peano numeral in order to compare any two arbitrary possibly-very-large structures' sizes... so I don't know what a really good interface for that is. Isaac

On Mon, Aug 20, 2007 at 03:43:43PM -0300, Isaac Dupree wrote:
Adrian Hey wrote:
I should mention that there's no particular technical reason why an AVL based Data.Map implementation could not provide O(1) size and O(log n) indexing. I just didn't think it was worth the extra cost and effort on my part. It would still cost O(n) in extra heap storage and indexing a Map is bit suspect IMO. The only reasonable use of indexing is covered by the clones OMap type or the AVL zipper I think, and I couldn't think of any good reason why anybody would want the size of a Map unless they were about to do something that was O(n) anyway (or worse).
Maybe a use being "if it's small, do something O(n^2)" or generally, choosing algorithm efficiency based on size (maybe random sampling if it's _really_ big...). I don't suppose there's an easy way to offer caching of information like that to those who have a use for it AT THE SAME TIME AS efficiency, for those who don't...
Sounds like an unsafeTreeHeight would be better...
I don't see a use for _indexing_ either. Reminds be a bit of C++ Boost's multi_container (I think) that could be a map and a sequence and another kind of map all at the same time, sort of...
Paterson's Data.FingerTree already covers just about every uncommon case imaginable with asymptotically very good performace, so there's not much need for indexing operations in Map. (And if people demand a fast structure with both indexing and lookup, we can always specialize again.) The one time I had occasion to use size, I was searching a map using log^2 (n) binary search using a monotonic function of the values; could something like a lazy O(n) assocsSet be added? (then I could use mapMonotonic and not need indexing at all). Stefan

On Mon, Aug 20, 2007 at 08:28:03AM -0300, Isaac Dupree wrote:
Also there is are "small length"-like functions that tell what the size of something is, up to a point, or "more than that"; this can probably be done with lazy "elems"? (e.g. minMapSize of two maps, at O(min(m,n))
Thanks to type classes and laziness, implementing that is as easy as providing a genericLength function, with lazy naturals taking the remainder of the burden; http://www.haskell.org/pipermail/haskell-cafe/2007-July/028503.html for more details. (Yes, this will be slower than a direct smallerThan operation, but if it's an infrequently used operation the simplicity is probably worth it). Stefan
participants (3)
-
Adrian Hey
-
Isaac Dupree
-
Stefan O'Rear