Re: [Haskell-cafe] ANN: unordered-containers - a new, faster hashing-based containers library

A couple thoughts: size takes O(n). That's just depressing. Really. Do you think union, intersection, etc. could be supported? Louis Wasserman wasserman.louis@gmail.com http://profiles.google.com/wasserman.louis

On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman
A couple thoughts: size takes O(n). That's just depressing. Really.
This applies to all the container types. We could support O(1) size at the cost of slowing down e.g lookup, insert, and delete a little bit. I haven't measure how much yet. Would it be worth it?
Do you think union, intersection, etc. could be supported? Louis Wasserman
Definitely. I plan to add most of the Data.Map API. We cannot support the ordered operations (such as min) but we should be able to support the rest. Johan

On Sat, Feb 19, 2011 at 3:04 PM, Johan Tibell
On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman
wrote: A couple thoughts: size takes O(n). That's just depressing. Really.
This applies to all the container types. We could support O(1) size at the cost of slowing down e.g lookup, insert, and delete a little bit. I haven't measure how much yet. Would it be worth it?
Getting a bit picky, but for the record, Data.Map and Data.Sequence provide O(1) size, and Data.HashTable I believe stores the information but doesn't expose it from its tiny API. That's not an argument either way for what a HashMap should do, however :-) Cheers, Sterl.

On Sat, Feb 19, 2011 at 4:27 PM, Sterling Clover
Getting a bit picky, but for the record, Data.Map and Data.Sequence provide O(1) size, and Data.HashTable I believe stores the information but doesn't expose it from its tiny API. That's not an argument either way for what a HashMap should do, however :-)
We could support O(1) size. I have a design that doesn't waste one word per node to store the size. I'll look into it as soon as I have time. Johan

On Sat, Feb 19, 2011 at 7:27 PM, Sterling Clover
On Sat, Feb 19, 2011 at 3:04 PM, Johan Tibell
wrote: On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman
wrote: A couple thoughts: size takes O(n). That's just depressing. Really.
This applies to all the container types. We could support O(1) size at the cost of slowing down e.g lookup, insert, and delete a little bit. I haven't measure how much yet. Would it be worth it?
Getting a bit picky, but for the record, Data.Map and Data.Sequence provide O(1) size, and Data.HashTable I believe stores the information but doesn't expose it from its tiny API. That's not an argument either way for what a HashMap should do, however :-)
NB: Data.IntMap, which Data.HashMap is based on, actually only provides O(n) size. -Edward

I'd like to complain about that, too ;)
Louis Wasserman
wasserman.louis@gmail.com
http://profiles.google.com/wasserman.louis
On Sat, Feb 19, 2011 at 9:02 PM, Edward Kmett
On Sat, Feb 19, 2011 at 7:27 PM, Sterling Clover
wrote: On Sat, Feb 19, 2011 at 3:04 PM, Johan Tibell
wrote: On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman
wrote: A couple thoughts: size takes O(n). That's just depressing. Really.
This applies to all the container types. We could support O(1) size at the cost of slowing down e.g lookup, insert, and delete a little bit. I haven't measure how much yet. Would it be worth it?
Getting a bit picky, but for the record, Data.Map and Data.Sequence provide O(1) size, and Data.HashTable I believe stores the information but doesn't expose it from its tiny API. That's not an argument either way for what a HashMap should do, however :-)
NB: Data.IntMap, which Data.HashMap is based on, actually only provides O(n) size.
-Edward

If you want to use the library and need a short term fix, just write a small wrapper type/module newtype SizedMap = SizedMap (Int, HashMap) and track the size yourself. only complication is that on inserts and deletes you'll need to check if the key existed. other than that, it shouldn't be too difficult. This way, the library stays super optimized but, if you need, you can track the size. As Johan said, it would slow down insert and delete a bit. shouldn't affect lookup though.. max On Feb 20, 2011, at 11:40 AM, Louis Wasserman wrote:
I'd like to complain about that, too ;)
Louis Wasserman wasserman.louis@gmail.com http://profiles.google.com/wasserman.louis
On Sat, Feb 19, 2011 at 9:02 PM, Edward Kmett
wrote: On Sat, Feb 19, 2011 at 7:27 PM, Sterling Clover wrote: On Sat, Feb 19, 2011 at 3:04 PM, Johan Tibell wrote: On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman
wrote: A couple thoughts: size takes O(n). That's just depressing. Really.
This applies to all the container types. We could support O(1) size at the cost of slowing down e.g lookup, insert, and delete a little bit. I haven't measure how much yet. Would it be worth it?
Getting a bit picky, but for the record, Data.Map and Data.Sequence provide O(1) size, and Data.HashTable I believe stores the information but doesn't expose it from its tiny API. That's not an argument either way for what a HashMap should do, however :-)
NB: Data.IntMap, which Data.HashMap is based on, actually only provides O(n) size.
-Edward
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, Feb 21, 2011 at 12:58 PM, Max Cantor
If you want to use the library and need a short term fix, just write a small wrapper type/module
newtype SizedMap = SizedMap (Int, HashMap) and track the size yourself. only complication is that on inserts and deletes you'll need to check if the key existed. other than that, it shouldn't be too difficult.
This way, the library stays super optimized but, if you need, you can track the size. As Johan said, it would slow down insert and delete a bit. shouldn't affect lookup though..
This isn't sufficient in all cases. How would you know the resulting size of a union or intersection? Cheers, -- Felipe.

On Mon, Feb 21, 2011 at 8:07 AM, Felipe Almeida Lessa
On Mon, Feb 21, 2011 at 12:58 PM, Max Cantor
wrote: If you want to use the library and need a short term fix, just write a small wrapper type/module
newtype SizedMap = SizedMap (Int, HashMap) and track the size yourself. only complication is that on inserts and deletes you'll need to check if the key existed. other than that, it shouldn't be too difficult.
This way, the library stays super optimized but, if you need, you can track the size. As Johan said, it would slow down insert and delete a bit. shouldn't affect lookup though..
This isn't sufficient in all cases. How would you know the resulting size of a union or intersection?
Note that the library does not at present support union or intersection! There are various ways to deal with the problem without caching sizes at nodes, one of which is to count overlap during union or intersection operations (since this involves counting leaves that are visited during these operations). -Jan-Willem Maessen

On 20 February 2011 03:40, Louis Wasserman
I'd like to complain about that, too ;)
I'm curious, when do people find that they need a really fast way to get map size? I use them quite a lot, and almost never call length - and when I do, it is typically only to get some statistics about the map for user display - certainly not in an inner loop. I find it is still useful to have a way to quickly test if a map is empty (for e.g. fixed points that slowly grow a map by unioning in a new bit every time) but "null" works OK for that purpose. Cheers, Max

* Max Bolingbroke
On 20 February 2011 03:40, Louis Wasserman
wrote: I'd like to complain about that, too ;)
I'm curious, when do people find that they need a really fast way to get map size? I use them quite a lot, and almost never call length - and when I do, it is typically only to get some statistics about the map for user display - certainly not in an inner loop.
I find it is still useful to have a way to quickly test if a map is empty (for e.g. fixed points that slowly grow a map by unioning in a new bit every time) but "null" works OK for that purpose.
For example, consider a situation where you have two sets (they can be maps as well), 'a' and 'b', such that by construction you know that 'a' is a subset (submap) of 'b'. Now, size a < size b is a convenient O(1) way to test whether 'a' is a proper subset (submap) of 'b'. -- Roman I. Cheplyaka :: http://ro-che.info/ Don't worry what people think, they don't do it very often.

On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman wrote: size takes O(n). That's just depressing. Really. That's rather thoughtless wording for some code that's (a) free (b) faster
than anything else currently available (c) in its very first release and (d)
available in source form for you to improve as you see fit. Just depressing.
Really.

bos:
On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman <[1]wasserman.louis@gmail.com> wrote:
size takes O(n). That's just depressing. Really.
That's rather thoughtless wording for some code that's (a) free (b) faster than anything else currently available (c) in its very first release and (d) available in source form for you to improve as you see fit. Just depressing. Really.
Agreed. This is open source: patches or it stays at O(n). -- Don

Apologies!
I was, in point of fact, working on such a patch, but I think I've been
convinced that it's a legitimate position for a package to take. (I'm also
curious, Johann, about the approach you figured out that didn't cost a word
per node!)
That said, I've been working with somewhat similar structures for my TrieMap
package, and I hadn't honestly considered the possibility of having size run
in anything but O(1). (When I was comparing benchmarks with
bytestring-trie, I didn't realize for a few hours that all of my benchmarks
called size, so O(W) operations were being benchmarked at O(n). That threw
me off..)
Louis Wasserman
wasserman.louis@gmail.com
http://profiles.google.com/wasserman.louis
On Tue, Feb 22, 2011 at 4:28 PM, Don Stewart
bos:
On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman <[1]wasserman.louis@gmail.com> wrote:
size takes O(n). That's just depressing. Really.
That's rather thoughtless wording for some code that's (a) free (b) faster than anything else currently available (c) in its very first release and (d) available in source form for you to improve as you see fit. Just depressing. Really.
Agreed. This is open source: patches or it stays at O(n).
-- Don

On Tue, Feb 22, 2011 at 6:28 PM, Louis Wasserman
Apologies!
Accepted. :)
I was, in point of fact, working on such a patch, but I think I've been convinced that it's a legitimate position for a package to take. (I'm also curious, Johann, about the approach you figured out that didn't cost a word per node!)
I'm working on a patch that provides O(1) size right now. The trick is to define HashMap as: data HashMap k v = HM {-# UNPACK #-} !Int !(Tree k v) data Tree k v = Nil | Tip {-# UNPACK #-} !Hash {-# UNPACK #-} !(FL.FullList k v) | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(Tree k v) !(Tree k v) And have insert, delete, and other operations that change the size return both the updated tree and the size delta (-1, 0, or 1 in the case of insert/delete). The size delta is then applied to the stored size. At the moment I'm trying to figure out the performance impact of making this change. Johan

Here's the approach I was attempting: a while back there was a proposal to
modify Data.Map to support a sort of zipper, including the following API:
data Location k a -- represents a map with a "hole" at a particular key
search :: k -> Map k a -> (Maybe a, Location k a)
key :: Location k a -> k
assign :: a -> Location k a -> Map k a
clear :: Location k a -> Map k a
and from here you might define
insert :: k -> a -> Map k a -> Map k a
insert k a m = case search k m of
(_, loc) -> assign a loc
The surprising thing was that this approach *increased* the speed of the Map
implementation by something like 7%. I don't know what happened to the
original proposal, but I was going to try something similar, with something
like
data HashMap k v = HM !Int !(Tree k v)
insert k v (HM sz tree) = case search k tree of
(Nothing, loc) -> HM (sz + 1) (assign v loc)
(Just _, loc) -> HM sz (assign v loc)
Louis Wasserman
wasserman.louis@gmail.com
http://profiles.google.com/wasserman.louis
On Tue, Feb 22, 2011 at 8:45 PM, Johan Tibell
On Tue, Feb 22, 2011 at 6:28 PM, Louis Wasserman
wrote: Apologies!
Accepted. :)
I was, in point of fact, working on such a patch, but I think I've been convinced that it's a legitimate position for a package to take. (I'm also curious, Johann, about the approach you figured out that didn't cost a word per node!)
I'm working on a patch that provides O(1) size right now. The trick is to define HashMap as:
data HashMap k v = HM {-# UNPACK #-} !Int !(Tree k v)
data Tree k v = Nil | Tip {-# UNPACK #-} !Hash {-# UNPACK #-} !(FL.FullList k v) | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(Tree k v) !(Tree k v)
And have insert, delete, and other operations that change the size return both the updated tree and the size delta (-1, 0, or 1 in the case of insert/delete). The size delta is then applied to the stored size.
At the moment I'm trying to figure out the performance impact of making this change.
Johan

On Tue, Feb 22, 2011 at 6:53 PM, Louis Wasserman
Here's the approach I was attempting: a while back there was a proposal to modify Data.Map to support a sort of zipper, including the following API:
data Location k a -- represents a map with a "hole" at a particular key
search :: k -> Map k a -> (Maybe a, Location k a)
key :: Location k a -> k
assign :: a -> Location k a -> Map k a clear :: Location k a -> Map k a
and from here you might define
insert :: k -> a -> Map k a -> Map k a insert k a m = case search k m of (_, loc) -> assign a loc
The surprising thing was that this approach *increased* the speed of the Map implementation by something like 7%.
I'm pretty sure it was a decrease overall. The function under discussion was something like insertWith. The direct implementation of insertWith was the fastest. The zipper approach was 7% faster than using a naive composition of lookup and insert to implement insertWith. The slowdown was most likely caused by the extra O(log n) allocations required to create the zipper (in essence that means reify the call stack of lookup). Johan

Hmmmkay. I'm not entirely convinced, 'cause I've seen genuine (and
benchmarked) speedups in applying some of these ideas to my TrieMap library,
but I don't have any examples handy.
Louis Wasserman
wasserman.louis@gmail.com
http://profiles.google.com/wasserman.louis
On Tue, Feb 22, 2011 at 9:10 PM, Johan Tibell
On Tue, Feb 22, 2011 at 6:53 PM, Louis Wasserman
wrote: Here's the approach I was attempting: a while back there was a proposal to modify Data.Map to support a sort of zipper, including the following API:
data Location k a -- represents a map with a "hole" at a particular key
search :: k -> Map k a -> (Maybe a, Location k a)
key :: Location k a -> k
assign :: a -> Location k a -> Map k a clear :: Location k a -> Map k a
and from here you might define
insert :: k -> a -> Map k a -> Map k a insert k a m = case search k m of (_, loc) -> assign a loc
The surprising thing was that this approach *increased* the speed of the Map implementation by something like 7%.
I'm pretty sure it was a decrease overall. The function under discussion was something like insertWith.
The direct implementation of insertWith was the fastest. The zipper approach was 7% faster than using a naive composition of lookup and insert to implement insertWith. The slowdown was most likely caused by the extra O(log n) allocations required to create the zipper (in essence that means reify the call stack of lookup).
Johan

On Tue, Feb 22, 2011 at 6:45 PM, Johan Tibell
I'm working on a patch that provides O(1) size right now. The trick is to define HashMap as:
data HashMap k v = HM {-# UNPACK #-} !Int !(Tree k v)
data Tree k v = Nil | Tip {-# UNPACK #-} !Hash {-# UNPACK #-} !(FL.FullList k v) | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(Tree k v) !(Tree k v)
And have insert, delete, and other operations that change the size return both the updated tree and the size delta (-1, 0, or 1 in the case of insert/delete). The size delta is then applied to the stored size.
At the moment I'm trying to figure out the performance impact of making this change.
Johan
Initial numbers suggest that lookup gets 3% slower and insert/delete 6% slower. The upside is O(1) size. Johan

On Tue, Feb 22, 2011 at 9:19 PM, Johan Tibell
Initial numbers suggest that lookup gets 3% slower and insert/delete 6% slower. The upside is O(1) size.
Can someone come up with a real world example where O(1) size is important? Johan

On 23 February 2011 05:31, Johan Tibell
Can someone come up with a real world example where O(1) size is important?
Tangentially - if you changed the API so the size function was called 'count' rather than 'size' or 'length', there would be no shame what's so ever in not being O(1). Problem solved :-)

On 23 February 2011 05:31, Johan Tibell
On Tue, Feb 22, 2011 at 9:19 PM, Johan Tibell
wrote: Initial numbers suggest that lookup gets 3% slower and insert/delete 6% slower. The upside is O(1) size.
Can someone come up with a real world example where O(1) size is important?
I'm a bit sceptical that it is (I was not convinced by the earlier strict-set-inclusion argument, since that's another Data.Map feature I've never used). I thought of some other possibilities though: 1. If copying an unordered-collection to a flat array you can improve the constant factors (not the asymptotics) with O(1) size to pre-allocate the array 2. If building a map in a fixed point loop (I personally do this a lot) where you know that the key uniquely determines the element, you can test if a fixed point is reached in O(1) by just comparing the sizes. Depending what you are taking a fixed point of, this may change the asymptotics 3. Some map combining algorithms work more efficiently when one of their two arguments are smaller. For example, Data.Map.union is most efficient for (bigmap `union` smallmap). If you don't care about which of the two input maps wins when they contain the same keys, you can improve constant factors by testing the size of the map input to size (in O(1)) and flipping the arguments if you got (smallmap `union` bigmap) instead of the desirable way round. Personally I don't find any of these *particularly* compelling. But a ~6% slowdown for this functionality is not too bad - have you had a chance to look at the core to see if the cause of the slowdown manifests itself at that level? Perhaps it is possible to tweak the code to make this cheaper. Also, what was the size of the collections you used in your benchmark? I would expect the relative cost of maintaining the size to get lower as you increased the size of the collection. Max

I took a rapid look and they seem a replacement for pure Map's, but not for
mutable HashTable's. Sorry if it isn't the case. I don´t know if
Data.HashTable has improved, but the performance used to be very poor in
comparison with other languages.
The point is that pure data structures can not be used as shared data in
multithreaded environments. They must be encapsulated in mutable blocking
references, such is MVars, and the whole update process blocks any other
thread . (it is worst when using TVars) So they are not a replacement for
mutable data structures such is Data.HashTable.
For this reason I think that an inprovement/mutable-replacement of
Data.HashTable is needed. If this hasn´t been done already. Are there
some improvements on it that I don't know?
2011/2/23 Max Bolingbroke
On 23 February 2011 05:31, Johan Tibell
wrote: On Tue, Feb 22, 2011 at 9:19 PM, Johan Tibell
wrote: Initial numbers suggest that lookup gets 3% slower and insert/delete 6% slower. The upside is O(1) size.
Can someone come up with a real world example where O(1) size is important?
I'm a bit sceptical that it is (I was not convinced by the earlier strict-set-inclusion argument, since that's another Data.Map feature I've never used). I thought of some other possibilities though: 1. If copying an unordered-collection to a flat array you can improve the constant factors (not the asymptotics) with O(1) size to pre-allocate the array 2. If building a map in a fixed point loop (I personally do this a lot) where you know that the key uniquely determines the element, you can test if a fixed point is reached in O(1) by just comparing the sizes. Depending what you are taking a fixed point of, this may change the asymptotics 3. Some map combining algorithms work more efficiently when one of their two arguments are smaller. For example, Data.Map.union is most efficient for (bigmap `union` smallmap). If you don't care about which of the two input maps wins when they contain the same keys, you can improve constant factors by testing the size of the map input to size (in O(1)) and flipping the arguments if you got (smallmap `union` bigmap) instead of the desirable way round.
Personally I don't find any of these *particularly* compelling. But a ~6% slowdown for this functionality is not too bad - have you had a chance to look at the core to see if the cause of the slowdown manifests itself at that level? Perhaps it is possible to tweak the code to make this cheaper.
Also, what was the size of the collections you used in your benchmark? I would expect the relative cost of maintaining the size to get lower as you increased the size of the collection.
Max
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wed, Feb 23, 2011 at 12:30 PM, Alberto G. Corona
For this reason I think that an inprovement/mutable-replacement of Data.HashTable is needed. If this hasn´t been done already. Are there some improvements on it that I don't know?
I've been working on one lately, some preliminary benchmarks:
https://gist.github.com/826935
It's probably a month or two away from a releasable state, but my
work-in-progress is substantially faster (4-6X) than Data.Hashtable
for inserts and lookups.
G
--
Gregory Collins

On 23 February 2011 12:05, Gregory Collins
I've been working on one lately, some preliminary benchmarks:
https://gist.github.com/826935
It's probably a month or two away from a releasable state, but my work-in-progress is substantially faster (4-6X) than Data.Hashtable for inserts and lookups.
Sounds cool. Can you use your HashTable in the ST monad? I have often found myself wanting a mutable hashtable as a subcomponent in an externally-pure computation. Max

On Wed, Feb 23, 2011 at 3:49 PM, Max Bolingbroke
On 23 February 2011 12:05, Gregory Collins
wrote: I've been working on one lately, some preliminary benchmarks:
https://gist.github.com/826935
It's probably a month or two away from a releasable state, but my work-in-progress is substantially faster (4-6X) than Data.Hashtable for inserts and lookups.
Sounds cool. Can you use your HashTable in the ST monad? I have often found myself wanting a mutable hashtable as a subcomponent in an externally-pure computation.
Yep, all of the functions are currently in ST.
G
--
Gregory Collins

On Wed, Feb 23, 2011 at 3:30 AM, Alberto G. Corona
The point is that pure data structures can not be used as shared data in multithreaded environments. They must be encapsulated in mutable blocking references, such is MVars, and the whole update process blocks any other thread . (it is worst when using TVars) So they are not a replacement for mutable data structures such is Data.HashTable.
Depends on your use case. With one writer and multiple readers an IORef containing a persistent data structure works well. Readers do not block writers and can they execute concurrently. HashTable is not a concurrent data structure. You need e.g. a lock free mutable hash table. Johan

On Wed, Feb 23, 2011 at 11:08 AM, Johan Tibell
... HashTable is not a concurrent data structure. You need e.g. a lock free mutable hash table.
Good implementations of which are *not* thick on the ground. Even java.util.concurrent isn't fully lock-free. -Jan

Thanks for the examples. Point 3 is interesting but most of the gain
there could probably be had by telling the user to use (bigmap `union`
smallmap). My guess is that the user has a good idea which argument is
larger/smaller.
On Wed, Feb 23, 2011 at 12:45 AM, Max Bolingbroke
Personally I don't find any of these *particularly* compelling. But a ~6% slowdown for this functionality is not too bad - have you had a chance to look at the core to see if the cause of the slowdown manifests itself at that level? Perhaps it is possible to tweak the code to make this cheaper.
I did look at the core for test :: HashMap Int Int -> HashMap Int Int test hm = insert 42 12 hm And I didn't see anything that looked particularly bad. The core uses unboxed values everywhere possible and the recursive function (i.e. inserts) returns an unboxed pair (# Int#, Tree k v #). The code for O(1) size i here: https://github.com/tibbe/unordered-containers/commit/77bc43acad2eb6b29472cd6...
Also, what was the size of the collections you used in your benchmark? I would expect the relative cost of maintaining the size to get lower as you increased the size of the collection.
I tried with 2^12. Johan

On 23 February 2011 16:03, Johan Tibell
Thanks for the examples. Point 3 is interesting but most of the gain there could probably be had by telling the user to use (bigmap `union` smallmap). My guess is that the user has a good idea which argument is larger/smaller.
Totally agreed - this matches my experience with Map/Set.
And I didn't see anything that looked particularly bad. The core uses unboxed values everywhere possible and the recursive function (i.e. inserts) returns an unboxed pair (# Int#, Tree k v #).
Right. I was wondering if you were returning (Int, Tree k v), in which case CPR wouldn't unbox the Int - but I see you already thought of that issue. By the way, do you plan to add a HashSet to complement HashMap? Thanks for all the work on the library! Max

On Wed, Feb 23, 2011 at 12:46 PM, Max Bolingbroke
By the way, do you plan to add a HashSet to complement HashMap?
Yes, definitely. I want to flesh out the HashMap API some more and then add a HashSet implementation. The code is still "moving around" a lot (e.g. between modules) and I like it to settle down a bit before I create HashSet (to avoid having more code to move around). Johan

On Wed, Feb 23, 2011 at 12:45 AM, Max Bolingbroke
I'm a bit sceptical that it is (I was not convinced by the earlier strict-set-inclusion argument, since that's another Data.Map feature I've never used). I thought of some other possibilities though: 1. If copying an unordered-collection to a flat array you can improve the constant factors (not the asymptotics) with O(1) size to pre-allocate the array
For kicks, I grepped my local repos for users of Data.Set's size function. My results are imperfect (I have few Github repos, I didn't check .lhs files, I only grepped files for 'Set.size' or 'S.size'), but I found around 100 uses of Data.Set.size. -- gwern http://www.gwern.net

On Wed, Feb 23, 2011 at 12:57 PM, Gwern Branwen
On Wed, Feb 23, 2011 at 12:45 AM, Max Bolingbroke
wrote: I'm a bit sceptical that it is (I was not convinced by the earlier strict-set-inclusion argument, since that's another Data.Map feature I've never used). I thought of some other possibilities though: 1. If copying an unordered-collection to a flat array you can improve the constant factors (not the asymptotics) with O(1) size to pre-allocate the array
For kicks, I grepped my local repos for users of Data.Set's size function. My results are imperfect (I have few Github repos, I didn't check .lhs files, I only grepped files for 'Set.size' or 'S.size'), but I found around 100 uses of Data.Set.size.
Could you manually look at some of them to see if you find something interesting. In particular `Set.size s == 0` (a common use of size in imperative languages) could be replaced by `Set.null s`. Johan

On Wed, Feb 23, 2011 at 1:18 PM, Johan Tibell
Could you manually look at some of them to see if you find something interesting. In particular `Set.size s == 0` (a common use of size in imperative languages) could be replaced by `Set.null s`.
You could look at them yourself; I attached the files. I see 6 uses out of ~100 which involve an == 0 -- gwern http://www.gwern.net

On Wed, Feb 23, 2011 at 1:27 PM, Gwern Branwen
You could look at them yourself; I attached the files. I see 6 uses out of ~100 which involve an == 0
Looks like the mailing list gateway didn't let your attachements through. No need to attach them though, I can just grep Hackage for uses. Johan

A quick grep of some of my own source reveals that I've used M.size and S.size only to test for sizes equal to 1. So, for my purposes at least, an O(1) isSingleton operation would be just as useful as an O(1) size. Cheers, Sterl On Feb 23, 2011, at 4:32 PM, Johan Tibell wrote:
On Wed, Feb 23, 2011 at 1:27 PM, Gwern Branwen
wrote: You could look at them yourself; I attached the files. I see 6 uses out of ~100 which involve an == 0
Looks like the mailing list gateway didn't let your attachements through. No need to attach them though, I can just grep Hackage for uses.
Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 2/23/11 4:42 PM, Sterling Clover wrote:
A quick grep of some of my own source reveals that I've used M.size and S.size only to test for sizes equal to 1. So, for my purposes at least, an O(1) isSingleton operation would be just as useful as an O(1) size.
I agree, a fast isSingleton function would cover a very common use case--- especially for set-like containers. -- Live well, ~wren

Hi, Am Mittwoch, den 23.02.2011, 20:06 -0500 schrieb wren ng thornton:
On 2/23/11 4:42 PM, Sterling Clover wrote:
A quick grep of some of my own source reveals that I've used M.size and S.size only to test for sizes equal to 1. So, for my purposes at least, an O(1) isSingleton operation would be just as useful as an O(1) size.
I agree, a fast isSingleton function would cover a very common use case--- especially for set-like containers.
would ghc’s rule system be strong enough to replace "size m == 1" by "isSingleton m"? It would be nice if programmers get the advantage even when they did not notice that a isSingleton function is provided. Greetings, Joachim -- Joachim "nomeata" Breitner mail: mail@joachim-breitner.de | ICQ# 74513189 | GPG-Key: 4743206C JID: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/ Debian Developer: nomeata@debian.org

On 2/24/11 3:05 AM, Joachim Breitner wrote:
Hi,
Am Mittwoch, den 23.02.2011, 20:06 -0500 schrieb wren ng thornton:
On 2/23/11 4:42 PM, Sterling Clover wrote:
A quick grep of some of my own source reveals that I've used M.size and S.size only to test for sizes equal to 1. So, for my purposes at least, an O(1) isSingleton operation would be just as useful as an O(1) size.
I agree, a fast isSingleton function would cover a very common use case--- especially for set-like containers.
would ghc’s rule system be strong enough to replace "size m == 1" by "isSingleton m"? It would be nice if programmers get the advantage even when they did not notice that a isSingleton function is provided.
Not really. I mean, yes, you could use rules for that; but it would be fragile and susceptible to breakage due to refactoring. Since maps are finite, you shouldn't get any changes in the domain semantics (i.e., I believe that whether you hit bottom or not cannot change based on whether rules fire), but you could get asymptotic changes (since size is O(n) times whatever loop you're in) and you can get memory leak changes as well (depending on whether CSE fires or not). For me, allowing that much semantic variability based on whether rules fire or not is unacceptable. This is why I disabled the rewrite rules in Data.List.Extras.LazyLength[1] (which, granted, is even worse since infinite lists means that you can get domain semantic differences as well). I'd rather advocate for having smart/lazy size comparison functions like Data.List.LazyLength offers, advertising them well, and then relying on users to say what they mean. [1] http://hackage.haskell.org/packages/archive/list-extras/0.4.0.1/doc/html/Dat... -- Live well, ~wren

On 2/23/11 8:06 PM, wren ng thornton wrote:
On 2/23/11 4:42 PM, Sterling Clover wrote:
A quick grep of some of my own source reveals that I've used M.size and S.size only to test for sizes equal to 1. So, for my purposes at least, an O(1) isSingleton operation would be just as useful as an O(1) size.
I agree, a fast isSingleton function would cover a very common use case--- especially for set-like containers.
On reflection, in order to handle all the comparisons of size to some small constant, it would be better to have two functions: one for comparing the size of a collection against a boundary condition, and one for comparing the size of two collections. That is, you should add analogues of the functions in Data.List.Extras.LazyLength[1]. The lazy boundary condition function is O(min(m,n)) where m is the size of the boundary and n is the size of the collection. For detecting singletons, doubletons, etc, this reduces to O(1) though there may be a constant factor hit vs dedicated isSingleton/isDoubleton/et functions. It would be a slight difference that diminishes as m increases, but it could be worth benchmarking... Similarly, the lazy comparative size function is O(min(m,n)) where m and n are the sizes of the two collections. This is always less than the O(m)+O(n) of taking the sizes individually[2], but it's remarkably less when one of the collections is known to be much smaller than the other (even if you don't know which is which). [1] http://hackage.haskell.org/packages/archive/list-extras/0.4.0.1/doc/html/Dat... [2] The O(m)+O(n) can be parallelized to yield O(max(m,n)) but that's still greater than O(min(m,n)). With some form of chunked-lazy natural numbers you may be able to get the parallel version to approach O(max(m,n)) and then get into a battle of constant factors. -- Live well, ~wren

On 23 February 2011 21:27, Gwern Branwen
On Wed, Feb 23, 2011 at 1:18 PM, Johan Tibell
wrote: Could you manually look at some of them to see if you find something interesting. In particular `Set.size s == 0` (a common use of size in imperative languages) could be replaced by `Set.null s`.
You could look at them yourself; I attached the files. I see 6 uses out of ~100 which involve an == 0
Thanks for bringing some data to the table. There are definitely some common patterns in what you sent me: 1) For defining Binary instances, you need to write set size before you write the elements: ~7 occurrences 2) Tests against small constants (typically <= 0 or 1, but also 2 and 3!): ~15 occurrences 3) A surprise to me: generating fresh names! People keep a set of all names generated so far, and then just take size+1 as their fresh name. Nice trick. ~17 occurrences 4) Turning sizes into strings for human consumption: ~19 occurrences 5) Just reexporting the functions somehow. Uninformative. ~8 occurrences There were ~38 occurrences over ~13 repos where it appeared to be somehow fundamental to an algorithm (I didn't look into most of these in detail). I've put those after my message. Frankly I am surprised how much "size" gets used. It seems that making it fast is more important than I thought. Cheers, Max === Fundamental-looking occurrences: bin/folkung/folkung/Clausify.hs: siz (And ps) = S.size ps bin/folkung/folkung/Paradox/Flatten.hs: | S.size cl >= 1 + S.size bs bin/folkung/folkung/Paradox/Flatten.hs: || S.size cl >= S.size cl' = largestClique cl gr' bin/folkung/folkung/Paradox/Flatten.hs: -- && S.size (free xs) <= 1 bin/folkung/folkung/Paradox/Flatten.hs: n = S.size (free ls) bin/folkung/folkung/Paradox/Flatten.hs: , S.size ws < n-1 bin/folkung/folkung/Paradox/Flatten.hs: (S.size s1,tpsize v2,inter s2) `compare` (S.size s2,tpsize v1,inter s1) bin/folkung/folkung/Paradox/Flatten.hs: sum [ S.size (s `S.intersection` vs) | (v,vs) <- cons, v `S.member` s ] bin/folkung/folkung/Paradox/Flatten.hs: , S.size ws' < S.size freeRight-1 bin/folkung/folkung/Paradox/Solve.hs: degree x = S.size . free $ x bin/gf/src/compiler/GF/Compile/GeneratePMCFG.hs: | product (map Set.size ys) == count bin/gf/src/compiler/GF/Speech/CFGToFA.hs: indeg (c,_) = maybe 0 Set.size $ Map.lookup c ub bin/gf/src/compiler/GF/Speech/CFGToFA.hs: where (fa', ns) = newStates (replicate (Set.size cs) ()) fa bin/gf/src/GF/Compile/GeneratePMCFG.hs: | product (map Set.size ys) == count = bin/gf/src/GF/Compile/GeneratePMCFGOld.hs: | product (map Set.size ys) == count = bin/gf/src/GF/Data/MultiMap.hs:size = sum . Prelude.map Set.size . Map.elems bin/gf/src/GF/Speech/CFGToFA.hs: indeg (c,_) = maybe 0 Set.size $ Map.lookup c ub bin/gf/src/GF/Speech/CFGToFA.hs: where (fa', ns) = newStates (replicate (Set.size cs) ()) fa bin/halfs/Halfs/FSRoot.hs: $ Set.size $ fsRootAllInodeNums fsroot bin/halfs/Halfs.hs: when (Set.size freeSet /= length freeList) ( bin/hcorpus/hcorpus.hs: let rank = 1 `fidiv` Set.size wrds, bin/hoogle/src/Hoogle/DataBase/TypeSearch/Binding.hs: g (l, vs) = Just $ [restrict|isJust l] ++ replicate (max 0 $ Set.size vs - 1) var bin/htab/src/Formula.hs:countNominals f = Set.size $ extractNominals f bin/hylores-2.5/src/HyLoRes/Clause/BasicClause.hs: size = Set.size . toFormulaSet bin/hylores-2.5/src/HyLoRes/Subsumption/ClausesByFormulaIndex.hs: let sortedCandidates = sortBy (compareUsing Set.size) subsCandidates bin/hylores-diego/src/HyLoRes/Clause/BasicClause.hs: size = Set.size . toFormulaSet bin/hylores-diego/src/HyLoRes/Subsumption/ClausesByFormulaIndex.hs: let sortedCandidates = sortBy (compareUsing Set.size) subsCandidates bin/ipclib/Language/Pepa/Compile/States.hs: Just limit -> ((stateSpaceSize seen) + (Set.size stack)) > limit bin/jhc/src/Grin/SSimplify.hs: v n | n `IS.member` s = v (1 + n + IS.size s) bin/jhc/src/Ho/Build.hs: maxModules <- Set.size `fmap` countNodes cn bin/jhc/src/Ho/Build.hs: maxModules <- Set.size `fmap` countNodes cn bin/lhc/src/Grin/Optimize/CallPattern.hs: nPatterns = Set.size callPatterns bin/proteinvis/Graphics/Visualization/Tree/Geometry.hs: where theta = ((fromIntegral index - (fromIntegral (S.size . fst . S.split (Name name) $ missingLeaves))) * 2 * 3.14157) / (num_leaves - fromIntegral (S.size missingLeaves)) bin/proteinvis/ProgramState.hs: c = S.size go bin/proteinvis/Protein.hs: , term_count = S.size terms bin/proteinvis/Protein.hs: , term_count = S.size terms bin/protocol-buffers/hprotoc/Text/ProtocolBuffers/ProtoCompile/Resolve.hs: when (Set.size numbers /= Seq.length (D.DescriptorProto.field dp)) $ bin/protocol-buffers/hprotoc/Text/ProtocolBuffers/ProtoCompile/Resolve.hs: when (Set.size (Set.fromList values) /= Seq.length vs) $ bin/xmobar/Plugins/Mail.hs: changeLoop (mapM (fmap S.size . readTVar) vs) $ \ns -> do

On Wed, Feb 23, 2011 at 1:55 PM, Max Bolingbroke
Thanks for bringing some data to the table. There are definitely some common patterns in what you sent me:
1) For defining Binary instances, you need to write set size before you write the elements: ~7 occurrences 2) Tests against small constants (typically <= 0 or 1, but also 2 and 3!): ~15 occurrences 3) A surprise to me: generating fresh names! People keep a set of all names generated so far, and then just take size+1 as their fresh name. Nice trick. ~17 occurrences 4) Turning sizes into strings for human consumption: ~19 occurrences 5) Just reexporting the functions somehow. Uninformative. ~8 occurrences
There were ~38 occurrences over ~13 repos where it appeared to be somehow fundamental to an algorithm (I didn't look into most of these in detail). I've put those after my message.
Frankly I am surprised how much "size" gets used. It seems that making it fast is more important than I thought.
Nice analysis. Does this apply to maps as well as sets or are sets use differently than maps somehow? IntMap (which shares data structure with HashMap) only hash O(n) size. I wonder if people avoid using IntMap because of this. I wonder if there's a way to implement size that doesn't mess up the code so badly (see the commit on GitHub to see how badly it messes up e.g. insert). Johan

Attached are all the uses of S.size and Set.size from a semi-recent snapshot of Hackage. Johan

Gwern Branwen wrote:
You could look at them yourself; I attached the files.
Max Bolingbroke wrote:
Frankly I am surprised how much "size" gets used. It seems that making it fast is more important than I thought.
Johan Tibell wrote:
IntMap (which shares data structure with HashMap) only hash O(n) size. I wonder if people avoid using IntMap because of this.
Another common usage for Map is as a functional integer-indexed random access array. Once I implemented the standard algorithm for random shuffle of a list using Data.Map Int. It was much nicer than the STArray version, in my opinion. But when I tried switching to Data.IntMap, hoping to make it faster, I was devastatingly disappointed. Now I understand why. -Yitz

On 2/23/11 4:55 PM, Max Bolingbroke wrote:
3) A surprise to me: generating fresh names! People keep a set of all names generated so far, and then just take size+1 as their fresh name. Nice trick. ~17 occurrences
This is a really common trick in NLP circles. Of course, it's easy to build a datastructure of your own which caches the size, rather than having the underlying datastructure do it. Speaking of which, I should put my package for that up on Hackage... -- Live well, ~wren

On Wed, Feb 23, 2011 at 08:45:47AM +0000, Max Bolingbroke wrote:
3. Some map combining algorithms work more efficiently when one of their two arguments are smaller. For example, Data.Map.union is most efficient for (bigmap `union` smallmap). If you don't care about which of the two input maps wins when they contain the same keys, you can improve constant factors by testing the size of the map input to size (in O(1)) and flipping the arguments if you got (smallmap `union` bigmap) instead of the desirable way round.
This is a most unfortunate feature of Data.Map, forcing users to bend the logic of their application to suit the library. Ideally you'd want binary operations that are symmetric and associative performance-wise, freeing users to follow the structure of their problem domain. The documentation (and the original paper) doesn't quantify the gain for such unions, but hopefully it achieves the optimum: O(m log(n/m)), where m and n are the sizes of the smaller and larger trees respectively. Then building a tree costs O(n log(n)), regardless of the way you do it.

Johan Tibell wrote:
I'm working on a patch that provides O(1) size right now. The trick is to define HashMap as:
data HashMap k v = HM {-# UNPACK #-} !Int !(Tree k v)
Another possibility is: data HashMap k v = HM Int !(Tree k v) hashMap t = HM (treeSize t) t That way size is O(n) on first use but O(1) afterwards. Then again, if someone really needs this they can program it themselves. I've never needed an O(1) size for maps. Roman

On 22 Feb 2011, at 22:21, Bryan O'Sullivan wrote:
for some code that's (b) faster than anything else currently available
I look forward to seeing some benchmarks against libraries other than containers, such as AVL trees, bytestring-trie, hamtmap, list-trie, etc. Good comparisons of different API-usage patterns are hard to come by. Regards, Malcolm

On Mar 14, 2011 6:23 PM, "Malcolm Wallace"
On 22 Feb 2011, at 22:21, Bryan O'Sullivan wrote:
for some code that's (b) faster than anything else currently available
I look forward to seeing some benchmarks against libraries other than
containers, such as AVL trees, bytestring-trie, hamtmap, list-trie, etc. Good comparisons of different API-usage patterns are hard to come by. Milan Straka compared containers to a number of other libraries (including some of those you mentioned) and found them all to be slower. Since unordered-containers is faster (or sometimes as fast) as containers I haven't really bothered comparing it to libraries other than containers. Johan
participants (21)
-
Alberto G. Corona
-
Bryan O'Sullivan
-
Don Stewart
-
Edward Kmett
-
Felipe Almeida Lessa
-
Gregory Collins
-
Gwern Branwen
-
Jan-Willem Maessen
-
Joachim Breitner
-
Johan Tibell
-
Louis Wasserman
-
Malcolm Wallace
-
Max Bolingbroke
-
Max Cantor
-
Roman Cheplyaka
-
Roman Leshchinskiy
-
Ross Paterson
-
Stephen Tetley
-
Sterling Clover
-
wren ng thornton
-
Yitzchak Gale