Re: [Haskell-cafe] ANN: unordered-containers - a new, faster hashing-based containers library - Haskell-Cafe

newer
Uninstall Haskell Platform on Mac...

Re: [Haskell-cafe] ANN: unordered-containers - a new, faster hashing-based containers library

older
Some quick experiments with GHC...

Louis Wasserman

19 Feb 2011 19 Feb '11

7:58 p.m.

A couple thoughts: size takes O(n). That's just depressing. Really. Do you think union, intersection, etc. could be supported? Louis Wasserman wasserman.louis@gmail.com http://profiles.google.com/wasserman.louis

Attachments:

attachment.html (text/html — 410 bytes)

Show replies by date

Johan Tibell

19 Feb 19 Feb

8:04 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman wrote:

...

A couple thoughts: size takes O(n). That's just depressing. Really.

This applies to all the container types. We could support O(1) size at the cost of slowing down e.g lookup, insert, and delete a little bit. I haven't measure how much yet. Would it be worth it?

...

Do you think union, intersection, etc. could be supported? Louis Wasserman

Definitely. I plan to add most of the Data.Map API. We cannot support the ordered operations (such as min) but we should be able to support the rest. Johan

Sterling Clover

20 Feb 20 Feb

12:27 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Sat, Feb 19, 2011 at 3:04 PM, Johan Tibell wrote:

...

On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman wrote:

...
A couple thoughts: size takes O(n). That's just depressing. Really.

This applies to all the container types. We could support O(1) size at the cost of slowing down e.g lookup, insert, and delete a little bit. I haven't measure how much yet. Would it be worth it?

Getting a bit picky, but for the record, Data.Map and Data.Sequence provide O(1) size, and Data.HashTable I believe stores the information but doesn't expose it from its tiny API. That's not an argument either way for what a HashMap should do, however :-) Cheers, Sterl.

Johan Tibell

1:16 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Sat, Feb 19, 2011 at 4:27 PM, Sterling Clover wrote:

...

Getting a bit picky, but for the record, Data.Map and Data.Sequence provide O(1) size, and Data.HashTable I believe stores the information but doesn't expose it from its tiny API. That's not an argument either way for what a HashMap should do, however :-)

We could support O(1) size. I have a design that doesn't waste one word per node to store the size. I'll look into it as soon as I have time. Johan

Edward Kmett

3:02 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Sat, Feb 19, 2011 at 7:27 PM, Sterling Clover wrote:

...

On Sat, Feb 19, 2011 at 3:04 PM, Johan Tibell wrote:

...
On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman wrote:

...
A couple thoughts: size takes O(n). That's just depressing. Really.

This applies to all the container types. We could support O(1) size at the cost of slowing down e.g lookup, insert, and delete a little bit. I haven't measure how much yet. Would it be worth it?

Getting a bit picky, but for the record, Data.Map and Data.Sequence provide O(1) size, and Data.HashTable I believe stores the information but doesn't expose it from its tiny API. That's not an argument either way for what a HashMap should do, however :-)

NB: Data.IntMap, which Data.HashMap is based on, actually only provides O(n) size. -Edward

Louis Wasserman

3:40 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

I'd like to complain about that, too ;) Louis Wasserman wasserman.louis@gmail.com http://profiles.google.com/wasserman.louis On Sat, Feb 19, 2011 at 9:02 PM, Edward Kmett wrote:

...

On Sat, Feb 19, 2011 at 7:27 PM, Sterling Clover wrote:

...
On Sat, Feb 19, 2011 at 3:04 PM, Johan Tibell wrote:

...
On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman wrote:

...
A couple thoughts: size takes O(n). That's just depressing. Really.

This applies to all the container types. We could support O(1) size at the cost of slowing down e.g lookup, insert, and delete a little bit. I haven't measure how much yet. Would it be worth it?

Getting a bit picky, but for the record, Data.Map and Data.Sequence provide O(1) size, and Data.HashTable I believe stores the information but doesn't expose it from its tiny API. That's not an argument either way for what a HashMap should do, however :-)

NB: Data.IntMap, which Data.HashMap is based on, actually only provides O(n) size.

-Edward

Max Cantor

21 Feb 21 Feb

12:58 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

If you want to use the library and need a short term fix, just write a small wrapper type/module newtype SizedMap = SizedMap (Int, HashMap) and track the size yourself. only complication is that on inserts and deletes you'll need to check if the key existed. other than that, it shouldn't be too difficult. This way, the library stays super optimized but, if you need, you can track the size. As Johan said, it would slow down insert and delete a bit. shouldn't affect lookup though.. max On Feb 20, 2011, at 11:40 AM, Louis Wasserman wrote:

...

I'd like to complain about that, too ;)

Louis Wasserman wasserman.louis@gmail.com http://profiles.google.com/wasserman.louis

On Sat, Feb 19, 2011 at 9:02 PM, Edward Kmett wrote: On Sat, Feb 19, 2011 at 7:27 PM, Sterling Clover wrote: On Sat, Feb 19, 2011 at 3:04 PM, Johan Tibell wrote:

...
On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman wrote:

...
A couple thoughts: size takes O(n). That's just depressing. Really.

This applies to all the container types. We could support O(1) size at the cost of slowing down e.g lookup, insert, and delete a little bit. I haven't measure how much yet. Would it be worth it?

Getting a bit picky, but for the record, Data.Map and Data.Sequence provide O(1) size, and Data.HashTable I believe stores the information but doesn't expose it from its tiny API. That's not an argument either way for what a HashMap should do, however :-)

NB: Data.IntMap, which Data.HashMap is based on, actually only provides O(n) size.

-Edward

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Felipe Almeida Lessa

1:07 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Mon, Feb 21, 2011 at 12:58 PM, Max Cantor wrote:

...

If you want to use the library and need a short term fix, just write a small wrapper type/module

newtype SizedMap = SizedMap (Int, HashMap) and track the size yourself. only complication is that on inserts and deletes you'll need to check if the key existed. other than that, it shouldn't be too difficult.

This way, the library stays super optimized but, if you need, you can track the size. As Johan said, it would slow down insert and delete a bit. shouldn't affect lookup though..

This isn't sufficient in all cases. How would you know the resulting size of a union or intersection? Cheers, -- Felipe.

Jan-Willem Maessen

1:32 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Mon, Feb 21, 2011 at 8:07 AM, Felipe Almeida Lessa wrote:

...

On Mon, Feb 21, 2011 at 12:58 PM, Max Cantor wrote:

...
If you want to use the library and need a short term fix, just write a small wrapper type/module

newtype SizedMap = SizedMap (Int, HashMap) and track the size yourself. only complication is that on inserts and deletes you'll need to check if the key existed. other than that, it shouldn't be too difficult.

This way, the library stays super optimized but, if you need, you can track the size. As Johan said, it would slow down insert and delete a bit. shouldn't affect lookup though..

This isn't sufficient in all cases. How would you know the resulting size of a union or intersection?

Note that the library does not at present support union or intersection! There are various ways to deal with the problem without caching sizes at nodes, one of which is to count overlap during union or intersection operations (since this involves counting leaves that are visited during these operations). -Jan-Willem Maessen

Max Bolingbroke

2:57 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 20 February 2011 03:40, Louis Wasserman wrote:

...

I'd like to complain about that, too ;)

I'm curious, when do people find that they need a really fast way to get map size? I use them quite a lot, and almost never call length - and when I do, it is typically only to get some statistics about the map for user display - certainly not in an inner loop. I find it is still useful to have a way to quickly test if a map is empty (for e.g. fixed points that slowly grow a map by unioning in a new bit every time) but "null" works OK for that purpose. Cheers, Max

Roman Cheplyaka

22 Feb 22 Feb

7:52 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

* Max Bolingbroke [2011-02-21 14:57:08+0000]

...

On 20 February 2011 03:40, Louis Wasserman wrote:

...
I'd like to complain about that, too ;)

I'm curious, when do people find that they need a really fast way to get map size? I use them quite a lot, and almost never call length - and when I do, it is typically only to get some statistics about the map for user display - certainly not in an inner loop.

I find it is still useful to have a way to quickly test if a map is empty (for e.g. fixed points that slowly grow a map by unioning in a new bit every time) but "null" works OK for that purpose.

For example, consider a situation where you have two sets (they can be maps as well), 'a' and 'b', such that by construction you know that 'a' is a subset (submap) of 'b'. Now, size a < size b is a convenient O(1) way to test whether 'a' is a proper subset (submap) of 'b'. -- Roman I. Cheplyaka :: http://ro-che.info/ Don't worry what people think, they don't do it very often.

Bryan O'Sullivan

10:21 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman

...

wrote:

...

size takes O(n). That's just depressing. Really.

That's rather thoughtless wording for some code that's (a) free (b) faster than anything else currently available (c) in its very first release and (d) available in source form for you to improve as you see fit. Just depressing. Really.

Don Stewart

10:28 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

bos:

...

On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman <[1]wasserman.louis@gmail.com> wrote:

size takes O(n). That's just depressing. Really.

That's rather thoughtless wording for some code that's (a) free (b) faster than anything else currently available (c) in its very first release and (d) available in source form for you to improve as you see fit. Just depressing. Really.

Agreed. This is open source: patches or it stays at O(n). -- Don

Louis Wasserman

23 Feb 23 Feb

2:28 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

Apologies! I was, in point of fact, working on such a patch, but I think I've been convinced that it's a legitimate position for a package to take. (I'm also curious, Johann, about the approach you figured out that didn't cost a word per node!) That said, I've been working with somewhat similar structures for my TrieMap package, and I hadn't honestly considered the possibility of having size run in anything but O(1). (When I was comparing benchmarks with bytestring-trie, I didn't realize for a few hours that all of my benchmarks called size, so O(W) operations were being benchmarked at O(n). That threw me off..) Louis Wasserman wasserman.louis@gmail.com http://profiles.google.com/wasserman.louis On Tue, Feb 22, 2011 at 4:28 PM, Don Stewart wrote:

...

bos:

...
On Sat, Feb 19, 2011 at 11:58 AM, Louis Wasserman <[1]wasserman.louis@gmail.com> wrote:

size takes O(n). That's just depressing. Really.

That's rather thoughtless wording for some code that's (a) free (b) faster than anything else currently available (c) in its very first release and (d) available in source form for you to improve as you see fit. Just depressing. Really.

Agreed. This is open source: patches or it stays at O(n).

-- Don

Johan Tibell

2:45 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Tue, Feb 22, 2011 at 6:28 PM, Louis Wasserman wrote:

...

Apologies!

Accepted. :)

...

I was, in point of fact, working on such a patch, but I think I've been convinced that it's a legitimate position for a package to take. (I'm also curious, Johann, about the approach you figured out that didn't cost a word per node!)

I'm working on a patch that provides O(1) size right now. The trick is to define HashMap as: data HashMap k v = HM {-# UNPACK #-} !Int !(Tree k v) data Tree k v = Nil | Tip {-# UNPACK #-} !Hash {-# UNPACK #-} !(FL.FullList k v) | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(Tree k v) !(Tree k v) And have insert, delete, and other operations that change the size return both the updated tree and the size delta (-1, 0, or 1 in the case of insert/delete). The size delta is then applied to the stored size. At the moment I'm trying to figure out the performance impact of making this change. Johan

Louis Wasserman

2:53 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

Here's the approach I was attempting: a while back there was a proposal to modify Data.Map to support a sort of zipper, including the following API: data Location k a -- represents a map with a "hole" at a particular key search :: k -> Map k a -> (Maybe a, Location k a) key :: Location k a -> k assign :: a -> Location k a -> Map k a clear :: Location k a -> Map k a and from here you might define insert :: k -> a -> Map k a -> Map k a insert k a m = case search k m of (_, loc) -> assign a loc The surprising thing was that this approach *increased* the speed of the Map implementation by something like 7%. I don't know what happened to the original proposal, but I was going to try something similar, with something like data HashMap k v = HM !Int !(Tree k v) insert k v (HM sz tree) = case search k tree of (Nothing, loc) -> HM (sz + 1) (assign v loc) (Just _, loc) -> HM sz (assign v loc) Louis Wasserman wasserman.louis@gmail.com http://profiles.google.com/wasserman.louis On Tue, Feb 22, 2011 at 8:45 PM, Johan Tibell wrote:

...

On Tue, Feb 22, 2011 at 6:28 PM, Louis Wasserman wrote:

...
Apologies!

Accepted. :)

...
I was, in point of fact, working on such a patch, but I think I've been convinced that it's a legitimate position for a package to take. (I'm also curious, Johann, about the approach you figured out that didn't cost a word per node!)

I'm working on a patch that provides O(1) size right now. The trick is to define HashMap as:

data HashMap k v = HM {-# UNPACK #-} !Int !(Tree k v)

data Tree k v = Nil | Tip {-# UNPACK #-} !Hash {-# UNPACK #-} !(FL.FullList k v) | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(Tree k v) !(Tree k v)

And have insert, delete, and other operations that change the size return both the updated tree and the size delta (-1, 0, or 1 in the case of insert/delete). The size delta is then applied to the stored size.

At the moment I'm trying to figure out the performance impact of making this change.

Johan

Johan Tibell

3:10 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Tue, Feb 22, 2011 at 6:53 PM, Louis Wasserman wrote:

...

Here's the approach I was attempting: a while back there was a proposal to modify Data.Map to support a sort of zipper, including the following API:

data Location k a -- represents a map with a "hole" at a particular key

search :: k -> Map k a -> (Maybe a, Location k a)

key :: Location k a -> k

assign :: a -> Location k a -> Map k a clear :: Location k a -> Map k a

and from here you might define

insert :: k -> a -> Map k a -> Map k a insert k a m = case search k m of (_, loc) -> assign a loc

The surprising thing was that this approach *increased* the speed of the Map implementation by something like 7%.

I'm pretty sure it was a decrease overall. The function under discussion was something like insertWith. The direct implementation of insertWith was the fastest. The zipper approach was 7% faster than using a naive composition of lookup and insert to implement insertWith. The slowdown was most likely caused by the extra O(log n) allocations required to create the zipper (in essence that means reify the call stack of lookup). Johan

Louis Wasserman

3:16 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

Hmmmkay. I'm not entirely convinced, 'cause I've seen genuine (and benchmarked) speedups in applying some of these ideas to my TrieMap library, but I don't have any examples handy. Louis Wasserman wasserman.louis@gmail.com http://profiles.google.com/wasserman.louis On Tue, Feb 22, 2011 at 9:10 PM, Johan Tibell wrote:

...

On Tue, Feb 22, 2011 at 6:53 PM, Louis Wasserman wrote:

...
Here's the approach I was attempting: a while back there was a proposal to modify Data.Map to support a sort of zipper, including the following API:

data Location k a -- represents a map with a "hole" at a particular key

search :: k -> Map k a -> (Maybe a, Location k a)

key :: Location k a -> k

assign :: a -> Location k a -> Map k a clear :: Location k a -> Map k a

and from here you might define

insert :: k -> a -> Map k a -> Map k a insert k a m = case search k m of (_, loc) -> assign a loc

The surprising thing was that this approach *increased* the speed of the Map implementation by something like 7%.

I'm pretty sure it was a decrease overall. The function under discussion was something like insertWith.

The direct implementation of insertWith was the fastest. The zipper approach was 7% faster than using a naive composition of lookup and insert to implement insertWith. The slowdown was most likely caused by the extra O(log n) allocations required to create the zipper (in essence that means reify the call stack of lookup).

Johan

Johan Tibell

5:19 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Tue, Feb 22, 2011 at 6:45 PM, Johan Tibell wrote:

...

I'm working on a patch that provides O(1) size right now. The trick is to define HashMap as:

data HashMap k v = HM {-# UNPACK #-} !Int !(Tree k v)

data Tree k v = Nil | Tip {-# UNPACK #-} !Hash {-# UNPACK #-} !(FL.FullList k v) | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(Tree k v) !(Tree k v)

And have insert, delete, and other operations that change the size return both the updated tree and the size delta (-1, 0, or 1 in the case of insert/delete). The size delta is then applied to the stored size.

At the moment I'm trying to figure out the performance impact of making this change.

Johan

Initial numbers suggest that lookup gets 3% slower and insert/delete 6% slower. The upside is O(1) size. Johan

Johan Tibell

5:31 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Tue, Feb 22, 2011 at 9:19 PM, Johan Tibell wrote:

...

Initial numbers suggest that lookup gets 3% slower and insert/delete 6% slower. The upside is O(1) size.

Can someone come up with a real world example where O(1) size is important? Johan

Stephen Tetley

8:27 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 23 February 2011 05:31, Johan Tibell wrote:

...

Can someone come up with a real world example where O(1) size is important?

Tangentially - if you changed the API so the size function was called 'count' rather than 'size' or 'length', there would be no shame what's so ever in not being O(1). Problem solved :-)

Max Bolingbroke

8:45 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 23 February 2011 05:31, Johan Tibell wrote:

...

On Tue, Feb 22, 2011 at 9:19 PM, Johan Tibell wrote:

...
Initial numbers suggest that lookup gets 3% slower and insert/delete 6% slower. The upside is O(1) size.

Can someone come up with a real world example where O(1) size is important?

I'm a bit sceptical that it is (I was not convinced by the earlier strict-set-inclusion argument, since that's another Data.Map feature I've never used). I thought of some other possibilities though: 1. If copying an unordered-collection to a flat array you can improve the constant factors (not the asymptotics) with O(1) size to pre-allocate the array 2. If building a map in a fixed point loop (I personally do this a lot) where you know that the key uniquely determines the element, you can test if a fixed point is reached in O(1) by just comparing the sizes. Depending what you are taking a fixed point of, this may change the asymptotics 3. Some map combining algorithms work more efficiently when one of their two arguments are smaller. For example, Data.Map.union is most efficient for (bigmap `union` smallmap). If you don't care about which of the two input maps wins when they contain the same keys, you can improve constant factors by testing the size of the map input to size (in O(1)) and flipping the arguments if you got (smallmap `union` bigmap) instead of the desirable way round. Personally I don't find any of these *particularly* compelling. But a ~6% slowdown for this functionality is not too bad - have you had a chance to look at the core to see if the cause of the slowdown manifests itself at that level? Perhaps it is possible to tweak the code to make this cheaper. Also, what was the size of the collections you used in your benchmark? I would expect the relative cost of maintaining the size to get lower as you increased the size of the collection. Max

Alberto G. Corona

11:30 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

I took a rapid look and they seem a replacement for pure Map's, but not for mutable HashTable's. Sorry if it isn't the case. I don´t know if Data.HashTable has improved, but the performance used to be very poor in comparison with other languages. The point is that pure data structures can not be used as shared data in multithreaded environments. They must be encapsulated in mutable blocking references, such is MVars, and the whole update process blocks any other thread . (it is worst when using TVars) So they are not a replacement for mutable data structures such is Data.HashTable. For this reason I think that an inprovement/mutable-replacement of Data.HashTable is needed. If this hasn´t been done already. Are there some improvements on it that I don't know? 2011/2/23 Max Bolingbroke

...

On 23 February 2011 05:31, Johan Tibell wrote:

...
On Tue, Feb 22, 2011 at 9:19 PM, Johan Tibell wrote:

...
Initial numbers suggest that lookup gets 3% slower and insert/delete 6% slower. The upside is O(1) size.

Can someone come up with a real world example where O(1) size is important?

I'm a bit sceptical that it is (I was not convinced by the earlier strict-set-inclusion argument, since that's another Data.Map feature I've never used). I thought of some other possibilities though: 1. If copying an unordered-collection to a flat array you can improve the constant factors (not the asymptotics) with O(1) size to pre-allocate the array 2. If building a map in a fixed point loop (I personally do this a lot) where you know that the key uniquely determines the element, you can test if a fixed point is reached in O(1) by just comparing the sizes. Depending what you are taking a fixed point of, this may change the asymptotics 3. Some map combining algorithms work more efficiently when one of their two arguments are smaller. For example, Data.Map.union is most efficient for (bigmap `union` smallmap). If you don't care about which of the two input maps wins when they contain the same keys, you can improve constant factors by testing the size of the map input to size (in O(1)) and flipping the arguments if you got (smallmap `union` bigmap) instead of the desirable way round.

Personally I don't find any of these *particularly* compelling. But a ~6% slowdown for this functionality is not too bad - have you had a chance to look at the core to see if the cause of the slowdown manifests itself at that level? Perhaps it is possible to tweak the code to make this cheaper.

Also, what was the size of the collections you used in your benchmark? I would expect the relative cost of maintaining the size to get lower as you increased the size of the collection.

Max

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Gregory Collins

12:05 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 12:30 PM, Alberto G. Corona wrote:

...

For this reason I think that an inprovement/mutable-replacement of Data.HashTable is needed. If this hasn´t been done already. Are there some improvements on it that I don't know?

I've been working on one lately, some preliminary benchmarks: https://gist.github.com/826935 It's probably a month or two away from a releasable state, but my work-in-progress is substantially faster (4-6X) than Data.Hashtable for inserts and lookups. G -- Gregory Collins

Max Bolingbroke

2:49 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 23 February 2011 12:05, Gregory Collins wrote:

...

I've been working on one lately, some preliminary benchmarks:

https://gist.github.com/826935

It's probably a month or two away from a releasable state, but my work-in-progress is substantially faster (4-6X) than Data.Hashtable for inserts and lookups.

Sounds cool. Can you use your HashTable in the ST monad? I have often found myself wanting a mutable hashtable as a subcomponent in an externally-pure computation. Max

Gregory Collins

3:05 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 3:49 PM, Max Bolingbroke wrote:

...

On 23 February 2011 12:05, Gregory Collins wrote:

...
I've been working on one lately, some preliminary benchmarks:

https://gist.github.com/826935

It's probably a month or two away from a releasable state, but my work-in-progress is substantially faster (4-6X) than Data.Hashtable for inserts and lookups.

Sounds cool. Can you use your HashTable in the ST monad? I have often found myself wanting a mutable hashtable as a subcomponent in an externally-pure computation.

Yep, all of the functions are currently in ST. G -- Gregory Collins

Johan Tibell

4:08 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 3:30 AM, Alberto G. Corona wrote:

...

The point is that pure data structures can not be used as shared data in multithreaded environments. They must be encapsulated in mutable blocking references, such is MVars, and the whole update process blocks any other thread . (it is worst when using TVars) So they are not a replacement for mutable data structures such is Data.HashTable.

Depends on your use case. With one writer and multiple readers an IORef containing a persistent data structure works well. Readers do not block writers and can they execute concurrently. HashTable is not a concurrent data structure. You need e.g. a lock free mutable hash table. Johan

Jan-Willem Maessen

6:45 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 11:08 AM, Johan Tibell wrote:

...

... HashTable is not a concurrent data structure. You need e.g. a lock free mutable hash table.

Good implementations of which are *not* thick on the ground. Even java.util.concurrent isn't fully lock-free. -Jan

Johan Tibell

4:03 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

Thanks for the examples. Point 3 is interesting but most of the gain there could probably be had by telling the user to use (bigmap `union` smallmap). My guess is that the user has a good idea which argument is larger/smaller. On Wed, Feb 23, 2011 at 12:45 AM, Max Bolingbroke wrote:

...

Personally I don't find any of these *particularly* compelling. But a ~6% slowdown for this functionality is not too bad - have you had a chance to look at the core to see if the cause of the slowdown manifests itself at that level? Perhaps it is possible to tweak the code to make this cheaper.

I did look at the core for test :: HashMap Int Int -> HashMap Int Int test hm = insert 42 12 hm And I didn't see anything that looked particularly bad. The core uses unboxed values everywhere possible and the recursive function (i.e. inserts) returns an unboxed pair (# Int#, Tree k v #). The code for O(1) size i here: https://github.com/tibbe/unordered-containers/commit/77bc43acad2eb6b29472cd6...

...

Also, what was the size of the collections you used in your benchmark? I would expect the relative cost of maintaining the size to get lower as you increased the size of the collection.

I tried with 2^12. Johan

Max Bolingbroke

8:46 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 23 February 2011 16:03, Johan Tibell wrote:

...

Thanks for the examples. Point 3 is interesting but most of the gain there could probably be had by telling the user to use (bigmap `union` smallmap). My guess is that the user has a good idea which argument is larger/smaller.

Totally agreed - this matches my experience with Map/Set.

...

And I didn't see anything that looked particularly bad. The core uses unboxed values everywhere possible and the recursive function (i.e. inserts) returns an unboxed pair (# Int#, Tree k v #).

Right. I was wondering if you were returning (Int, Tree k v), in which case CPR wouldn't unbox the Int - but I see you already thought of that issue. By the way, do you plan to add a HashSet to complement HashMap? Thanks for all the work on the library! Max

Johan Tibell

8:50 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 12:46 PM, Max Bolingbroke wrote:

...

By the way, do you plan to add a HashSet to complement HashMap?

Yes, definitely. I want to flesh out the HashMap API some more and then add a HashSet implementation. The code is still "moving around" a lot (e.g. between modules) and I like it to settle down a bit before I create HashSet (to avoid having more code to move around). Johan

Gwern Branwen

8:57 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 12:45 AM, Max Bolingbroke wrote:

...

I'm a bit sceptical that it is (I was not convinced by the earlier strict-set-inclusion argument, since that's another Data.Map feature I've never used). I thought of some other possibilities though: 1. If copying an unordered-collection to a flat array you can improve the constant factors (not the asymptotics) with O(1) size to pre-allocate the array

For kicks, I grepped my local repos for users of Data.Set's size function. My results are imperfect (I have few Github repos, I didn't check .lhs files, I only grepped files for 'Set.size' or 'S.size'), but I found around 100 uses of Data.Set.size. -- gwern http://www.gwern.net

Johan Tibell

9:18 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 12:57 PM, Gwern Branwen wrote:

...

On Wed, Feb 23, 2011 at 12:45 AM, Max Bolingbroke wrote:

...
I'm a bit sceptical that it is (I was not convinced by the earlier strict-set-inclusion argument, since that's another Data.Map feature I've never used). I thought of some other possibilities though: 1. If copying an unordered-collection to a flat array you can improve the constant factors (not the asymptotics) with O(1) size to pre-allocate the array

For kicks, I grepped my local repos for users of Data.Set's size function. My results are imperfect (I have few Github repos, I didn't check .lhs files, I only grepped files for 'Set.size' or 'S.size'), but I found around 100 uses of Data.Set.size.

Could you manually look at some of them to see if you find something interesting. In particular `Set.size s == 0` (a common use of size in imperative languages) could be replaced by `Set.null s`. Johan

Gwern Branwen

9:27 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 1:18 PM, Johan Tibell wrote:

...

Could you manually look at some of them to see if you find something interesting. In particular `Set.size s == 0` (a common use of size in imperative languages) could be replaced by `Set.null s`.

You could look at them yourself; I attached the files. I see 6 uses out of ~100 which involve an == 0 -- gwern http://www.gwern.net

Johan Tibell

9:32 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 1:27 PM, Gwern Branwen wrote:

...

You could look at them yourself; I attached the files. I see 6 uses out of ~100 which involve an == 0

Looks like the mailing list gateway didn't let your attachements through. No need to attach them though, I can just grep Hackage for uses. Johan

Sterling Clover

9:42 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

A quick grep of some of my own source reveals that I've used M.size and S.size only to test for sizes equal to 1. So, for my purposes at least, an O(1) isSingleton operation would be just as useful as an O(1) size. Cheers, Sterl On Feb 23, 2011, at 4:32 PM, Johan Tibell wrote:

...

On Wed, Feb 23, 2011 at 1:27 PM, Gwern Branwen wrote:

...
You could look at them yourself; I attached the files. I see 6 uses out of ~100 which involve an == 0

Looks like the mailing list gateway didn't let your attachements through. No need to attach them though, I can just grep Hackage for uses.

Johan

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

wren ng thornton

24 Feb 24 Feb

1:06 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 2/23/11 4:42 PM, Sterling Clover wrote:

...

A quick grep of some of my own source reveals that I've used M.size and S.size only to test for sizes equal to 1. So, for my purposes at least, an O(1) isSingleton operation would be just as useful as an O(1) size.

I agree, a fast isSingleton function would cover a very common use case--- especially for set-like containers. -- Live well, ~wren

Joachim Breitner

8:05 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

Hi, Am Mittwoch, den 23.02.2011, 20:06 -0500 schrieb wren ng thornton:

...

On 2/23/11 4:42 PM, Sterling Clover wrote:

...
A quick grep of some of my own source reveals that I've used M.size and S.size only to test for sizes equal to 1. So, for my purposes at least, an O(1) isSingleton operation would be just as useful as an O(1) size.

I agree, a fast isSingleton function would cover a very common use case--- especially for set-like containers.

would ghc’s rule system be strong enough to replace "size m == 1" by "isSingleton m"? It would be nice if programmers get the advantage even when they did not notice that a isSingleton function is provided. Greetings, Joachim -- Joachim "nomeata" Breitner mail: mail@joachim-breitner.de | ICQ# 74513189 | GPG-Key: 4743206C JID: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/ Debian Developer: nomeata@debian.org

wren ng thornton

11:49 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 2/24/11 3:05 AM, Joachim Breitner wrote:

...

Hi,

Am Mittwoch, den 23.02.2011, 20:06 -0500 schrieb wren ng thornton:

...
On 2/23/11 4:42 PM, Sterling Clover wrote:

...
A quick grep of some of my own source reveals that I've used M.size and S.size only to test for sizes equal to 1. So, for my purposes at least, an O(1) isSingleton operation would be just as useful as an O(1) size.

I agree, a fast isSingleton function would cover a very common use case--- especially for set-like containers.

would ghc’s rule system be strong enough to replace "size m == 1" by "isSingleton m"? It would be nice if programmers get the advantage even when they did not notice that a isSingleton function is provided.

Not really. I mean, yes, you could use rules for that; but it would be fragile and susceptible to breakage due to refactoring. Since maps are finite, you shouldn't get any changes in the domain semantics (i.e., I believe that whether you hit bottom or not cannot change based on whether rules fire), but you could get asymptotic changes (since size is O(n) times whatever loop you're in) and you can get memory leak changes as well (depending on whether CSE fires or not). For me, allowing that much semantic variability based on whether rules fire or not is unacceptable. This is why I disabled the rewrite rules in Data.List.Extras.LazyLength[1] (which, granted, is even worse since infinite lists means that you can get domain semantic differences as well). I'd rather advocate for having smart/lazy size comparison functions like Data.List.LazyLength offers, advertising them well, and then relying on users to say what they mean. [1] http://hackage.haskell.org/packages/archive/list-extras/0.4.0.1/doc/html/Dat... -- Live well, ~wren

wren ng thornton

11:40 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 2/23/11 8:06 PM, wren ng thornton wrote:

...

On 2/23/11 4:42 PM, Sterling Clover wrote:

...
A quick grep of some of my own source reveals that I've used M.size and S.size only to test for sizes equal to 1. So, for my purposes at least, an O(1) isSingleton operation would be just as useful as an O(1) size.

I agree, a fast isSingleton function would cover a very common use case--- especially for set-like containers.

On reflection, in order to handle all the comparisons of size to some small constant, it would be better to have two functions: one for comparing the size of a collection against a boundary condition, and one for comparing the size of two collections. That is, you should add analogues of the functions in Data.List.Extras.LazyLength[1]. The lazy boundary condition function is O(min(m,n)) where m is the size of the boundary and n is the size of the collection. For detecting singletons, doubletons, etc, this reduces to O(1) though there may be a constant factor hit vs dedicated isSingleton/isDoubleton/et functions. It would be a slight difference that diminishes as m increases, but it could be worth benchmarking... Similarly, the lazy comparative size function is O(min(m,n)) where m and n are the sizes of the two collections. This is always less than the O(m)+O(n) of taking the sizes individually[2], but it's remarkably less when one of the collections is known to be much smaller than the other (even if you don't know which is which). [1] http://hackage.haskell.org/packages/archive/list-extras/0.4.0.1/doc/html/Dat... [2] The O(m)+O(n) can be parallelized to yield O(max(m,n)) but that's still greater than O(min(m,n)). With some form of chunked-lazy natural numbers you may be able to get the parallel version to approach O(max(m,n)) and then get into a battle of constant factors. -- Live well, ~wren

Max Bolingbroke

23 Feb 23 Feb

9:55 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 23 February 2011 21:27, Gwern Branwen wrote:

...

On Wed, Feb 23, 2011 at 1:18 PM, Johan Tibell wrote:

...
Could you manually look at some of them to see if you find something interesting. In particular `Set.size s == 0` (a common use of size in imperative languages) could be replaced by `Set.null s`.

You could look at them yourself; I attached the files. I see 6 uses out of ~100 which involve an == 0

Thanks for bringing some data to the table. There are definitely some common patterns in what you sent me: 1) For defining Binary instances, you need to write set size before you write the elements: ~7 occurrences 2) Tests against small constants (typically <= 0 or 1, but also 2 and 3!): ~15 occurrences 3) A surprise to me: generating fresh names! People keep a set of all names generated so far, and then just take size+1 as their fresh name. Nice trick. ~17 occurrences 4) Turning sizes into strings for human consumption: ~19 occurrences 5) Just reexporting the functions somehow. Uninformative. ~8 occurrences There were ~38 occurrences over ~13 repos where it appeared to be somehow fundamental to an algorithm (I didn't look into most of these in detail). I've put those after my message. Frankly I am surprised how much "size" gets used. It seems that making it fast is more important than I thought. Cheers, Max === Fundamental-looking occurrences: bin/folkung/folkung/Clausify.hs: siz (And ps) = S.size ps bin/folkung/folkung/Paradox/Flatten.hs: | S.size cl >= 1 + S.size bs bin/folkung/folkung/Paradox/Flatten.hs: || S.size cl >= S.size cl' = largestClique cl gr' bin/folkung/folkung/Paradox/Flatten.hs: -- && S.size (free xs) <= 1 bin/folkung/folkung/Paradox/Flatten.hs: n = S.size (free ls) bin/folkung/folkung/Paradox/Flatten.hs: , S.size ws < n-1 bin/folkung/folkung/Paradox/Flatten.hs: (S.size s1,tpsize v2,inter s2) `compare` (S.size s2,tpsize v1,inter s1) bin/folkung/folkung/Paradox/Flatten.hs: sum [ S.size (s `S.intersection` vs) | (v,vs) <- cons, v `S.member` s ] bin/folkung/folkung/Paradox/Flatten.hs: , S.size ws' < S.size freeRight-1 bin/folkung/folkung/Paradox/Solve.hs: degree x = S.size . free $ x bin/gf/src/compiler/GF/Compile/GeneratePMCFG.hs: | product (map Set.size ys) == count bin/gf/src/compiler/GF/Speech/CFGToFA.hs: indeg (c,_) = maybe 0 Set.size $ Map.lookup c ub bin/gf/src/compiler/GF/Speech/CFGToFA.hs: where (fa', ns) = newStates (replicate (Set.size cs) ()) fa bin/gf/src/GF/Compile/GeneratePMCFG.hs: | product (map Set.size ys) == count = bin/gf/src/GF/Compile/GeneratePMCFGOld.hs: | product (map Set.size ys) == count = bin/gf/src/GF/Data/MultiMap.hs:size = sum . Prelude.map Set.size . Map.elems bin/gf/src/GF/Speech/CFGToFA.hs: indeg (c,_) = maybe 0 Set.size $ Map.lookup c ub bin/gf/src/GF/Speech/CFGToFA.hs: where (fa', ns) = newStates (replicate (Set.size cs) ()) fa bin/halfs/Halfs/FSRoot.hs: $ Set.size $ fsRootAllInodeNums fsroot bin/halfs/Halfs.hs: when (Set.size freeSet /= length freeList) ( bin/hcorpus/hcorpus.hs: let rank = 1 `fidiv` Set.size wrds, bin/hoogle/src/Hoogle/DataBase/TypeSearch/Binding.hs: g (l, vs) = Just $ [restrict|isJust l] ++ replicate (max 0 $ Set.size vs - 1) var bin/htab/src/Formula.hs:countNominals f = Set.size $ extractNominals f bin/hylores-2.5/src/HyLoRes/Clause/BasicClause.hs: size = Set.size . toFormulaSet bin/hylores-2.5/src/HyLoRes/Subsumption/ClausesByFormulaIndex.hs: let sortedCandidates = sortBy (compareUsing Set.size) subsCandidates bin/hylores-diego/src/HyLoRes/Clause/BasicClause.hs: size = Set.size . toFormulaSet bin/hylores-diego/src/HyLoRes/Subsumption/ClausesByFormulaIndex.hs: let sortedCandidates = sortBy (compareUsing Set.size) subsCandidates bin/ipclib/Language/Pepa/Compile/States.hs: Just limit -> ((stateSpaceSize seen) + (Set.size stack)) > limit bin/jhc/src/Grin/SSimplify.hs: v n | n `IS.member` s = v (1 + n + IS.size s) bin/jhc/src/Ho/Build.hs: maxModules <- Set.size `fmap` countNodes cn bin/jhc/src/Ho/Build.hs: maxModules <- Set.size `fmap` countNodes cn bin/lhc/src/Grin/Optimize/CallPattern.hs: nPatterns = Set.size callPatterns bin/proteinvis/Graphics/Visualization/Tree/Geometry.hs: where theta = ((fromIntegral index - (fromIntegral (S.size . fst . S.split (Name name) $ missingLeaves))) * 2 * 3.14157) / (num_leaves - fromIntegral (S.size missingLeaves)) bin/proteinvis/ProgramState.hs: c = S.size go bin/proteinvis/Protein.hs: , term_count = S.size terms bin/proteinvis/Protein.hs: , term_count = S.size terms bin/protocol-buffers/hprotoc/Text/ProtocolBuffers/ProtoCompile/Resolve.hs: when (Set.size numbers /= Seq.length (D.DescriptorProto.field dp)) $ bin/protocol-buffers/hprotoc/Text/ProtocolBuffers/ProtoCompile/Resolve.hs: when (Set.size (Set.fromList values) /= Seq.length vs) $ bin/xmobar/Plugins/Mail.hs: changeLoop (mapM (fmap S.size . readTVar) vs) $ \ns -> do

Johan Tibell

10:04 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 1:55 PM, Max Bolingbroke wrote:

...

Thanks for bringing some data to the table. There are definitely some common patterns in what you sent me:

1) For defining Binary instances, you need to write set size before you write the elements: ~7 occurrences 2) Tests against small constants (typically <= 0 or 1, but also 2 and 3!): ~15 occurrences 3) A surprise to me: generating fresh names! People keep a set of all names generated so far, and then just take size+1 as their fresh name. Nice trick. ~17 occurrences 4) Turning sizes into strings for human consumption: ~19 occurrences 5) Just reexporting the functions somehow. Uninformative. ~8 occurrences

There were ~38 occurrences over ~13 repos where it appeared to be somehow fundamental to an algorithm (I didn't look into most of these in detail). I've put those after my message.

Frankly I am surprised how much "size" gets used. It seems that making it fast is more important than I thought.

Nice analysis. Does this apply to maps as well as sets or are sets use differently than maps somehow? IntMap (which shares data structure with HashMap) only hash O(n) size. I wonder if people avoid using IntMap because of this. I wonder if there's a way to implement size that doesn't mess up the code so badly (see the commit on GitHub to see how badly it messes up e.g. insert). Johan

Johan Tibell

10:40 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

Attached are all the uses of S.size and Set.size from a semi-recent snapshot of Hackage. Johan

Yitzchak Gale

24 Feb 24 Feb

10:51 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

Gwern Branwen wrote:

...

...
...
You could look at them yourself; I attached the files.

Max Bolingbroke wrote:

...

...
Frankly I am surprised how much "size" gets used. It seems that making it fast is more important than I thought.

Johan Tibell wrote:

...

IntMap (which shares data structure with HashMap) only hash O(n) size. I wonder if people avoid using IntMap because of this.

Another common usage for Map is as a functional integer-indexed random access array. Once I implemented the standard algorithm for random shuffle of a list using Data.Map Int. It was much nicer than the STArray version, in my opinion. But when I tried switching to Data.IntMap, hoping to make it faster, I was devastatingly disappointed. Now I understand why. -Yitz

wren ng thornton

1:09 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 2/23/11 4:55 PM, Max Bolingbroke wrote:

...

3) A surprise to me: generating fresh names! People keep a set of all names generated so far, and then just take size+1 as their fresh name. Nice trick. ~17 occurrences

This is a really common trick in NLP circles. Of course, it's easy to build a datastructure of your own which caches the size, rather than having the underlying datastructure do it. Speaking of which, I should put my package for that up on Hackage... -- Live well, ~wren

Ross Paterson

12:36 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Wed, Feb 23, 2011 at 08:45:47AM +0000, Max Bolingbroke wrote:

...

3. Some map combining algorithms work more efficiently when one of their two arguments are smaller. For example, Data.Map.union is most efficient for (bigmap `union` smallmap). If you don't care about which of the two input maps wins when they contain the same keys, you can improve constant factors by testing the size of the map input to size (in O(1)) and flipping the arguments if you got (smallmap `union` bigmap) instead of the desirable way round.

This is a most unfortunate feature of Data.Map, forcing users to bend the logic of their application to suit the library. Ideally you'd want binary operations that are symmetric and associative performance-wise, freeing users to follow the structure of their problem domain. The documentation (and the original paper) doesn't quantify the gain for such unions, but hopefully it achieves the optimum: O(m log(n/m)), where m and n are the sizes of the smaller and larger trees respectively. Then building a tree costs O(n log(n)), regardless of the way you do it.

Roman Leshchinskiy

23 Feb 23 Feb

12:10 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

Johan Tibell wrote:

...

I'm working on a patch that provides O(1) size right now. The trick is to define HashMap as:

data HashMap k v = HM {-# UNPACK #-} !Int !(Tree k v)

Another possibility is: data HashMap k v = HM Int !(Tree k v) hashMap t = HM (treeSize t) t That way size is O(n) on first use but O(1) afterwards. Then again, if someone really needs this they can program it themselves. I've never needed an O(1) size for maps. Roman

Malcolm Wallace

14 Mar 14 Mar

11:21 a.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On 22 Feb 2011, at 22:21, Bryan O'Sullivan wrote:

...

for some code that's (b) faster than anything else currently available

I look forward to seeing some benchmarks against libraries other than containers, such as AVL trees, bytestring-trie, hamtmap, list-trie, etc. Good comparisons of different API-usage patterns are hard to come by. Regards, Malcolm

Johan Tibell

2 p.m.

New subject: ANN: unordered-containers - a new, faster hashing-based containers library

On Mar 14, 2011 6:23 PM, "Malcolm Wallace" wrote:

...

On 22 Feb 2011, at 22:21, Bryan O'Sullivan wrote:

...
for some code that's (b) faster than anything else currently available

I look forward to seeing some benchmarks against libraries other than

containers, such as AVL trees, bytestring-trie, hamtmap, list-trie, etc. Good comparisons of different API-usage patterns are hard to come by. Milan Straka compared containers to a number of other libraries (including some of those you mentioned) and found them all to be slower. Since unordered-containers is faster (or sometimes as fast) as containers I haven't really bothered comparing it to libraries other than containers. Johan

5226

Age (days ago)

5249

Last active (days ago)

List overview

Download

48 comments

21 participants

participants (21)

Alberto G. Corona
Bryan O'Sullivan
Don Stewart
Edward Kmett
Felipe Almeida Lessa
Gregory Collins
Gwern Branwen
Jan-Willem Maessen
Joachim Breitner
Johan Tibell
Louis Wasserman
Malcolm Wallace
Max Bolingbroke
Max Cantor
Roman Cheplyaka
Roman Leshchinskiy
Ross Paterson
Stephen Tetley
Sterling Clover
wren ng thornton
Yitzchak Gale