Re: Proposal: containers: add indexing operations for Set

10 Jul 2012

      Hi,
...
On Tue, Jul 10, 2012 at 1:59 AM, Johan Tibell  wrote:
...
On Thu, Jul 5, 2012 at 8:01 AM, Milan Straka  wrote:
...
Any balanced tree can support indexing API, as long as every subtree
stores its size. That is obviously trivially true for size-balanced
trees, as the subtree sizes are used for balancing. If for example AVL
trees were used, sizes would have to be stored explicitly in every node
for indexing API to work fast.
There might be some other benefits to adding the size as well. IIRC
union is faster if the first set is smaller than the second set. If we
can look up the sizes in O(1) time we can rearrange the arguments to
make sure this is the case.
I've always wondered why Data.Map doesn't do this by default.  E.g. I have:
-- | Map.union says it's more efficient with @big `union` small@, so this one
-- flips the args to be more efficient.  It's still left-biased.
union2 :: (Ord k) => Map.Map k v -> Map.Map k v -> Map.Map k v
union2 m1 m2
    | Map.size m1 < Map.size m2 = m2 `right` m1
    | otherwise = m1 `Map.union` m2
    where right = Map.unionWith (\_ a -> a)
there are several problems with this.

Most notably, it is not true that "set1 < set2" is always better than
"set1 > set2" -- if you run the
containers/benchmarks/SetOperations/bench-SetOperations-Map benchmark,
you can see that for different input distributions, "set1 < set2" is
sometimes faster and sometimes slower than "set1 > set2". For various
input distributions, difference between "set1 < set2" and "set1 > set2"
are -7.5%, -25%, -35%, -17%, +20%, +8%, -13%.

Also union is more efficient than (unionWith const), because is
preserves sharing of the nodes present in both sets. That could be
solved by creating another implementation of right-biased unionR which
also preserves sharing.

Probably the best idea is just to remove the documentation comment about
"bigset `union` smallset" being faster, as it is not true :)

Cheers,
Milan