RFC: Collection Framework, reprise.

Hello,
I'm glad you see so many good point in my design.
I must say however that the operator names are (c) Ross Paterson :)
On 2/21/06, Robert Dockins
Well, I have to begin with the caveat that I am not a data structures expert, and that the design of Edison is almost entirely due to Chris Okasaki. That said, I can certainly share my opinion. In no particular order: -- Your use of two type parameters to model read-only or unobservable collections is very interesting. I wonder, though, how useful it actually is. I facilitates some interesting possibilities with views, but I can't think of any compelling use cases. I sort of dislike the fact that it can turn certain of the class methods "useless", but maybe the idea just hasn't grown on me yet.
The truth is, I hope this will become nicer when (if) we move to associated types. I need to play a bit more with Phrac to have a clearer idea though.
-- I like the idea of views generally, and I'm now trying to think if there is a good way to work something like them into Edison. -- The 'isSingleton' method seems unnecessary. Are singleton collections interesting enough to supply a special test for them?
This is based on Christan Maeder idea. We are planning to move to AVL trees for A. Hey, which do not provide a O(1) size function. A singleton test thus becomes valuable, and since costs next to nothing to add to the class I decided to to it.
-- Your 'lookup' method has a Monad context, which I assume is the 'NotJustMaybe' pattern, but 'front' and 'back' return Maybe. You should probably pick one style and stick to it.
True.
-- Qualified infix functions look horrible. I know you say this module should be imported unqualified, but with Prelude clashes people qualify it anyway. I'd suggest you stick to regular prefix functions for your class methods and define the infix functions as the appropriate aliases. That way people importing qualified can still use something reasonable looking. -- In a related thread I've been discussing argument orders. I think I've just about come to the decision that all the Edison API functions will take the collection argument last, including the troublesome 'rcons'. If you take my suggestion to remove infix symbols from your classes, I suggest you also make your right cons and index functions take the collection argument last. However, I'd probably leave the symbols as is, so that, e.g., (|>) = (flip rcons).
If i understand this correctly, you suggest that regular functions should have a prefix meaning, whereas operators should have infix meaning (also in the light of your answer to Ross). I remember some peope spoke against this when we tuned Data.Map (& friends) API some years ago. I guess the reason is that some functions cannot be assigned a clear enough operator (isSubsetOf, etc.)
-- In fact, I like your right and left insert symbols so much, I may adopt them myself. -- You claim that 'union' takes an unspecified element in the case of duplicates, but you also claim it is commutative. You need to specify in what sense you claim commutativity; your collection class doesn't require an Eq or Ord instance, and the claim seems very suspect if you mean up to identity. Similar with intersection.
This has been written in a hurry, and I didn't mean anything very formal. We have re-hashed implied meanings of Eq and Ord quite a few times however, and the general idea is that we require Eq to mean equality, and similarly for Ord, otherwise all bets are off. This is better stated in a general comment rather than on the definition on union though.
-- Does 'adjust' affect all elements with the given key, or just a single unspecified one?
The class has the (implicit) assumption that a key cannot exist twice. The semantic could be generalized however.
-- 'Indexed' doesn't have any notion of bounds. That seems like a pretty necessary concept. Humm, I see now that MapLike is a subclass of Indexed. Perhaps then you should have 'DenseIndexed' or 'ArrayLike' which has bounds related methods together with the "denseness" property.
It seems like the class you suggest to add would become a replacement for the Array class.
Overall, I'd say the methods you have carved out seem to be the ones that are most-used, with the notable exception of 'head' and 'tail' for sequences.
Rather, I intend to start with those. It seems easier to add than to remove.
If you are attempting to go for a fairly minimal set of operations (that's more or less what I read your design goals as saying), then the only method I'd consider adding is a strict fold. People get irritated if they can't, e.g., sum the elements in a collection without building a gajillion unnecessary closures.
Every day I pray for a genius to come with the generic solution to eliminate unnecessary lazyness :) Thanks again for the helpful comments. JP.

On Feb 23, 2006, at 5:33 AM, Jean-Philippe Bernardy wrote:
... On 2/21/06, Robert Dockins
wrote: -- I like the idea of views generally, and I'm now trying to think if there is a good way to work something like them into Edison. -- The 'isSingleton' method seems unnecessary. Are singleton collections interesting enough to supply a special test for them?
This is based on Christan Maeder idea. We are planning to move to AVL trees for A. Hey, which do not provide a O(1) size function. A singleton test thus becomes valuable, and since costs next to nothing to add to the class I decided to to it.
This leads me to wonder how much of the speed advantage of AVL trees is simply due to the absence of cached tree-size information in each node (not required for the balancing mechanism, but also not available for fast indexing operations). -Jan-Willem Maessen

On Feb 23, 2006, at 5:33 AM, Jean-Philippe Bernardy wrote:
Hello,
I'm glad you see so many good point in my design.
I must say however that the operator names are (c) Ross Paterson :)
On 2/21/06, Robert Dockins
wrote: Well, I have to begin with the caveat that I am not a data structures expert, and that the design of Edison is almost entirely due to Chris Okasaki. That said, I can certainly share my opinion. In no particular order: -- Your use of two type parameters to model read-only or unobservable collections is very interesting. I wonder, though, how useful it actually is. I facilitates some interesting possibilities with views, but I can't think of any compelling use cases. I sort of dislike the fact that it can turn certain of the class methods "useless", but maybe the idea just hasn't grown on me yet.
The truth is, I hope this will become nicer when (if) we move to associated types. I need to play a bit more with Phrac to have a clearer idea though.
-- I like the idea of views generally, and I'm now trying to think if there is a good way to work something like them into Edison. -- The 'isSingleton' method seems unnecessary. Are singleton collections interesting enough to supply a special test for them?
This is based on Christan Maeder idea. We are planning to move to AVL trees for A. Hey, which do not provide a O(1) size function. A singleton test thus becomes valuable, and since costs next to nothing to add to the class I decided to to it.
OK. I'm still not sure why you'd be interested in knowing that a collection is a singleton. Especially given that you can't extract elements with only your Collection class methods. It seems to me that the only time you want a singleton class is when you want to get that thing out. If that's the case, you can just extract one element and test the resulting collection for nullness; but maybe there's something I'm missing.
-- Your 'lookup' method has a Monad context, which I assume is the 'NotJustMaybe' pattern, but 'front' and 'back' return Maybe. You should probably pick one style and stick to it.
True.
-- Qualified infix functions look horrible. I know you say this module should be imported unqualified, but with Prelude clashes people qualify it anyway. I'd suggest you stick to regular prefix functions for your class methods and define the infix functions as the appropriate aliases. That way people importing qualified can still use something reasonable looking. -- In a related thread I've been discussing argument orders. I think I've just about come to the decision that all the Edison API functions will take the collection argument last, including the troublesome 'rcons'. If you take my suggestion to remove infix symbols from your classes, I suggest you also make your right cons and index functions take the collection argument last. However, I'd probably leave the symbols as is, so that, e.g., (|>) = (flip rcons).
If i understand this correctly, you suggest that regular functions should have a prefix meaning, whereas operators should have infix meaning (also in the light of your answer to Ross).
I'm not sure I understand exactly what you mean by "prefix/infix meaing", but I think so.
I remember some peope spoke against this when we tuned Data.Map (& friends) API some years ago. I guess the reason is that some functions cannot be assigned a clear enough operator (isSubsetOf, etc.)
Humm. I can't say I'm fully persuaded by this. As others have mentioned Data.{Map,Set} aren't exactly paragons of elegance, and I'm not inclined to be bound to their design decisions. I also think the landscape may be changing with Haskell implementations drifting toward full support for Unicode source files, which would provide lovely operators for all sorts of collection operations. In the meantime, I think we can just provide ASCII infix operators for troublesome or particularly common operations. Does anyone know any other reasons this approach was rejected? I suspect that qualified imports also played a role in this decision.
-- In fact, I like your right and left insert symbols so much, I may adopt them myself. -- You claim that 'union' takes an unspecified element in the case of duplicates, but you also claim it is commutative. You need to specify in what sense you claim commutativity; your collection class doesn't require an Eq or Ord instance, and the claim seems very suspect if you mean up to identity. Similar with intersection.
This has been written in a hurry, and I didn't mean anything very formal. We have re-hashed implied meanings of Eq and Ord quite a few times however, and the general idea is that we require Eq to mean equality, and similarly for Ord, otherwise all bets are off.
I guess my point is that your classes don't require an Eq or Ord context, so its hard to know what defines the equivalence class under consideration.
This is better stated in a general comment rather than on the definition on union though.
-- Does 'adjust' affect all elements with the given key, or just a single unspecified one?
The class has the (implicit) assumption that a key cannot exist twice. The semantic could be generalized however.
-- 'Indexed' doesn't have any notion of bounds. That seems like a pretty necessary concept. Humm, I see now that MapLike is a subclass of Indexed. Perhaps then you should have 'DenseIndexed' or 'ArrayLike' which has bounds related methods together with the "denseness" property.
It seems like the class you suggest to add would become a replacement for the Array class.
Well, the only reason I mention it is because you say in the comments that indexes are "dense", and that lookup cannot return "not found". That sounds particularly array-like, except for the lack of bounds. As "MapLike" is a subclass of "Indexed", you should probably remove those comments. If you want a class with those properties it should not be a superclass of "MapLike" and probably needs bounds.
Overall, I'd say the methods you have carved out seem to be the ones that are most-used, with the notable exception of 'head' and 'tail' for sequences.
Rather, I intend to start with those. It seems easier to add than to remove.
Indeed.
If you are attempting to go for a fairly minimal set of operations (that's more or less what I read your design goals as saying), then the only method I'd consider adding is a strict fold. People get irritated if they can't, e.g., sum the elements in a collection without building a gajillion unnecessary closures.
Every day I pray for a genius to come with the generic solution to eliminate unnecessary lazyness :)
I just bet that complete and sound strictness analysis is undecidable (does anybody know for sure?); I don't think we'll ever fully get away from having to deal with manual strictness annotations once we've committed to non-strict semantics. <aside> If there's one thing I learned from reading _Purely Function Data Structures_, it is that strictness and laziness both have their places. I'm glad to see bang patterns getting some serious consideration for Haskell'. </aside>
Thanks again for the helpful comments. JP.
Rob Dockins Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG

Jean-Philippe Bernardy wrote:
-- The 'isSingleton' method seems unnecessary. Are singleton collections interesting enough to supply a special test for them?
This is based on Christan Maeder idea. We are planning to move to AVL trees for A. Hey, which do not provide a O(1) size function. A singleton test thus becomes valuable, and since costs next to nothing to add to the class I decided to to it.
Here's a more general suggestion: move the method 'front' up into class 'Collection'. It is always meaningful in some way: it would get the leftmost item from a sequence, the least key from a set, the least priority from a priority queue and so on. Sure, the naming is not the most fortunate. What would be appropriate for a method that extracts a somewhat unspecified element? 'split', 'extract', 'view', 'decompose', 'item' or something else? Alternatively, 'front' could be left where it is, extracting the leftmost element. A similar method, which extracts an unspecified element, should be added to 'Collection'. Once you have 'front' you can implement 'isSingleton' and 'null' outside the class. The same is true for 'fold' and a strict 'foldl' which is currently missing. Udo. -- "Being abstract is something profoundly different from being vague." -- E. W. Dijkstra

Sometimes you want to add an element, or update an existing element if one exists. You may, for example, want to store a list of values at each node in a Map, so if a key doesn't exist you add a singleton list, if the key does exist you add the new value to the list with that key. The value type is thus of a *different* type than the values that are inserted. insertWith' :: (b -> a -> a) -> (b -> a) -> k -> a -> c -> c Where the second function is used only in the "not existing" case. So the case described above could be done like so myInsert = insertWith' (:) (:[]) That is, if the key exists, the value is simply added to the front of the list, if the key doesn't exist, it's added to an empty list. This is more general than insertWith, so maybe insertWith could be defined outside the class using this... insertWith' is probably not the best name... Or... you could generalise adjust instead... adjust :: (Maybe v -> v) -> k -> c -> c myInsert k v = adjust f k where f Nothing = [v] f (Just vs) = v:vs I think I may like this better... It moves the "insertion type is different from stored type"-thing outside the type signatures... /S -- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

On 2/23/06, Sebastian Sylvan
Sometimes you want to add an element, or update an existing element if one exists.
You may, for example, want to store a list of values at each node in a Map, so if a key doesn't exist you add a singleton list, if the key does exist you add the new value to the list with that key. The value type is thus of a *different* type than the values that are inserted.
insertWith' :: (b -> a -> a) -> (b -> a) -> k -> a -> c -> c
Sorry: insertWith' :: (b -> a -> a) -> (b -> a) -> k -> b -> c -> c /S -- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

On Thu, Feb 23, 2006 at 11:33:27AM +0100, Jean-Philippe Bernardy wrote:
I must say however that the operator names are (c) Ross Paterson :)
The operator symbols are just ASCII versions of symbols that have been in use for some time, e.g. by Chris Okasaki.
We are planning to move to AVL trees for A. Hey, which do not provide a O(1) size function.
What will be gain for this move? And what will we lose, apart from O(1) size?

On 2/27/06, Ross Paterson
We are planning to move to AVL trees for A. Hey, which do not provide a O(1) size function.
What will be gain for this move? And what will we lose, apart from O(1) size?
The gains: 1. Christian Maeder and others report a significant performance improvement. This is the main motivating factor. 2. The AVL tree library is very comprehensive and can be very useful by itself. 3. A separate balanced Tree infrastructure provide many benefits: * Shared code between Map and Set. * Conversion between those should then become more efficient (requested by John Meacham) * Possibility to base more data structures on AVL trees (eg. StringMap) * It should be easier to fine-tune algorithms. We will provide toTree/unsafeFromSortedTree that will allow users to write code that have access to the tree structure. An alternative is to keep the same tree implentation, and separate the Tree data type from the high-level Set/Map abstract type. This is yet more work however, and ignores the fact that the trees in set and map are actually different types. The loss: * More code to maintain. AVL trees are also trickier. * Transitional costs: changes in behaviour may affect user code. (Despite many efforts to be as compatible as possible) Cheers, JP.

Jean-Philippe Bernardy wrote:
On 2/27/06, Ross Paterson
wrote: We are planning to move to AVL trees for A. Hey, which do not provide a O(1) size function. What will be gain for this move? And what will we lose, apart from O(1) size?
The gains: 1. Christian Maeder and others report a significant performance improvement. This is the main motivating factor.
Indeed, the gain is significant (at least 5% lookup speedup!). Adrian Hey simply proved to me that AVL trees (with his clever constructor choice without an explicit height) are a better data structure than Adam's "Efficient sets: a balancing act". And it's sort of regrettable that the old FiniteMap and the current Data.Map are based on the latter (until someone finds a speedup for Adam's sets)!
2. The AVL tree library is very comprehensive and can be very useful by itself.
I don't share this point. The trees suffer the severe problem that the order argument can easily be mixed up. So I'd prefer not to directly use these trees. I also did not like the hscpp stuff and the many (tricky?) modules when installing it.
3. A separate balanced Tree infrastructure provide many benefits: * Shared code between Map and Set. * Conversion between those should then become more efficient (requested by John Meacham)
Without sharing the current Data.Map is faster and even the currently fastest "AVL (k, a)" would become faster. However, I would not sacrifice sharing for an only marginal speedup (below 1%). (But it would be interesting to know the speedup.) further personal judgements: 1. Data.IntMap and Data.IntSet don't need to be touched since they are fast. 2. (portable) Data.Map and Data.Set modules belong into the base package. These data structures would be useful for ghc itself. 3. Jean-Philippe's collection framework may sit on top in a different package (supporting more generic applications) 4. A few interface cleanups should be no problem Cheers Christian

On Wednesday 01 Mar 2006 2:13 pm, Christian Maeder wrote:
Jean-Philippe Bernardy wrote:
On 2/27/06, Ross Paterson
wrote: We are planning to move to AVL trees for A. Hey, which do not provide a O(1) size function.
What will be gain for this move? And what will we lose, apart from O(1) size?
The gains: 1. Christian Maeder and others report a significant performance improvement. This is the main motivating factor.
Indeed, the gain is significant (at least 5% lookup speedup!). Adrian Hey simply proved to me that AVL trees (with his clever constructor choice without an explicit height) are a better data structure than Adam's "Efficient sets: a balancing act". And it's sort of regrettable that the old FiniteMap and the current Data.Map are based on the latter (until someone finds a speedup for Adam's sets)!
Well 5% is not so wonderful IMO :-) But I guess for Maps it ain't to bad either considering the indirection handicap suffered by AVL (until we get type specialisation :-) What were your keys, btw? I think the real problems the Adams trees are.. 1 - They don't do a very good job of balancing in pathological situations (sorted inserts), at least as currently implemented. Though I believe the algortithm does have a tweekable parameter, so maybe it could do better if it was tweeked. 2 - The Hedge algorithm appears to suck badly, if number of comparisons needed to do union etc.. is the benchmark. Even with cheap comparisons (Ints), AVL+divide&conquer is much faster (for Sets at least). Regards -- Adrian Hey

Adrian Hey wrote:
What were your keys, btw?
data ShATerm = ShAAppl String [Int] [Int] | ShAList [Int] [Int] | ShAInt Integer [Int] deriving (Eq, Ord) (where the last list is always empty) and something (with short strings) like: data Id = Id [String] [Id] and I used Int keys a lot for the comparison (where Data.IntMap is always faster). Cheers Christian

On Wednesday 01 Mar 2006 4:01 pm, Christian Maeder wrote:
Adrian Hey wrote:
What were your keys, btw?
data ShATerm = ShAAppl String [Int] [Int]
| ShAList [Int] [Int] | ShAInt Integer [Int]
deriving (Eq, Ord)
(where the last list is always empty)
and something (with short strings) like:
data Id = Id [String] [Id]
I think this sort of thing is good illustration of why I don't think the indirection overhead of implementing Maps as AVL trees of association pairs will matter much in practice. Chances are that it will only be used in circumstances where indirection cost is trivial compared to comparison costs. There's also a small saving in heap burn rate too (1 less field per tree node). It also illustrates why I'm now quite sceptical about the use of any balanced tree implementation of sets and maps. It should only really be used as a default first implementation of JPB's API which will work for any instance of Ord, but it will never be very efficient for non-trivial Key types IMO. I think what we should be aiming for in the long term is automatically derived Generalised Tries (as discussed recently). Regards -- Adrian Hey

On Thu, Mar 02, 2006 at 09:13:11PM +0000, Adrian Hey wrote:
I think what we should be aiming for in the long term is automatically derived Generalised Tries (as discussed recently).
Would it be possible to mock up an implementation of this using Data.Generics by chance? so we can use a generalized tree for anything in Data and Ord? I find the idea of a derived generalized trie quite intriguing. would associated types be the best way to express such a thing (so you can use 'Trie a' everywhere as a type rather than some complicated class predicate). <noise> once get get derived generalized tries, I want to try derived generalized huffman (arithmetic?) encoding, derived generalized LZ*, and a derived generalized efficient 'diff' algorithm... </noise> John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham writes:
On Thu, Mar 02, 2006 at 09:13:11PM +0000, Adrian Hey wrote:
I think what we should be aiming for in the long term is automatically derived Generalised Tries (as discussed recently).
Would it be possible to mock up an implementation of this using Data.Generics by chance? so we can use a generalized tree for anything in Data and Ord? I find the idea of a derived generalized trie quite intriguing. would associated types be the best way to express such a thing (so you can use 'Trie a' everywhere as a type rather than some complicated class predicate).
A while back, I put together an implementation of generic tries based on
Ralf Hinze's "Generics for the Masses". This particular technique allows
you to write M functions for N types with M+N instances, instead of M*N.
It's now on-line at http://www.eyrie.org/~zednenem/2006/gentrie/. It's
a darcs repository, so "darcs get" will work. There are two versions
(corresponding to associated type synonyms and associated data types),
plus an alternate interface that hides the map's type.
--
David Menendez

Hello John, Friday, March 3, 2006, 1:10:17 AM, you wrote: JM> Would it be possible to mock up an implementation of this using JM> Data.Generics by chance? so we can use a generalized tree for anything are you know that Data.Generics has run-time overheads and therefore can be several times slower than fixed algorithm or algorithm generated at compile time (for example, by TH) ? JM> <noise> JM> once get get derived generalized tries, I want to try derived JM> generalized huffman (arithmetic?) encoding, derived generalized LZ*, and JM> a derived generalized efficient 'diff' algorithm... JM> </noise> and this will be 1000 times slower that C implementations -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Fri, Mar 03, 2006 at 11:34:58AM +0300, Bulat Ziganshin wrote:
Friday, March 3, 2006, 1:10:17 AM, you wrote:
JM> Would it be possible to mock up an implementation of this using JM> Data.Generics by chance? so we can use a generalized tree for anything
are you know that Data.Generics has run-time overheads and therefore can be several times slower than fixed algorithm or algorithm generated at compile time (for example, by TH) ?
JM> <noise> JM> once get get derived generalized tries, I want to try derived JM> generalized huffman (arithmetic?) encoding, derived generalized LZ*, and JM> a derived generalized efficient 'diff' algorithm... JM> </noise>
and this will be 1000 times slower that C implementations
Oh yes. I am completly aware of that. I just find the algorithms fascinating. DrIFT would be the obvious real way to go for haskell (or TH if you are into that sort of thing) or perhaps c-generated haskell... -- John Meacham - ⑆repetae.net⑆john⑈

On Thursday 02 Mar 2006 10:10 pm, John Meacham wrote:
On Thu, Mar 02, 2006 at 09:13:11PM +0000, Adrian Hey wrote:
I think what we should be aiming for in the long term is automatically derived Generalised Tries (as discussed recently).
Would it be possible to mock up an implementation of this using Data.Generics by chance? so we can use a generalized tree for anything in Data and Ord? I find the idea of a derived generalized trie quite intriguing. would associated types be the best way to express such a thing (so you can use 'Trie a' everywhere as a type rather than some complicated class predicate).
I'm afraid I'm not sufficiently well versed in bleedin' edge type theory and type system extensions to comment. But I would say that my preferance would be for some compile time solution, rather than Data.Generics (because of the runtime cost which I suspect this may involve). This exercise is all about performance after all, so I don't want to throw any away unnecessarily. I think your earlier idea of using DrIFT seemed good, though I'm not sure what you can do with it or whether it needs modifying in any way. (For generalised tries we're deriving new types and then deriving new class instances of the newly derived types.) Since we last discussed this I've been thinking about this and have a few points I'd like to mention if you'd like to give this a shot .. Regarding efficiency, I think that in these nested Maps of Maps of Maps.. it's likely that the overwhelming majority of these Maps are singletons and also that you will get long singleton chains. Also for the Map tuples used for algebraic data type constructors, it's likely that most of the Maps in the tuple are empty. So there's probably some optimised representation you could use to exploit these facts. A good test would be to see if the derivation produced a Trie implemntation for Lists that was as efficient as a handmade ListMap using Tries, maybe similar to.. http://homepages.nildram.co.uk/~ahey/HLibs/Data.StringMap/ Regarding the role of the Ord class in all this, I think we should still insist that Key types are instances of Ord (and that the Trie implementation is consistent with that instance) for several reasons.. 1 - Compatibility with balanced tree implementations. 2 - To give meaningful ordering for functions like elemsAscending etc.. 3 - To enable a standard solution to the "Set of Sets" problem that doesn't require deriving yet another generalised Trie type for the original generalised Trie type, if you see what I mean (more about this later). For derived instances of Ord this probably isn't too hard. It seems somewhat more difficult for hand written instances (e.g. one that used a different ordering from the default for performance reasons). Regarding the "Set of Sets" problem, if the element type is an instance of Ord then we can uniquely represent a Set as a sorted list (obtained via elemsAscending of similar). So we could just use a general Trie based ListMap(Set) implementation for these. All that's needed to implement this an efficient Map implementation for the element type (which has already been derived). This all seems doable in practice, though no doubt there's some gotcha which I've overlooked. I have no plans to do this in the near future but it would be really good if someone else wanted to give it a go (someone such as yourself for example :-). Regards -- Adrian Hey
participants (11)
-
Adrian Hey
-
Bulat Ziganshin
-
Christian Maeder
-
David Menendez
-
Jan-Willem Maessen
-
Jean-Philippe Bernardy
-
John Meacham
-
Robert Dockins
-
Ross Paterson
-
Sebastian Sylvan
-
Udo Stenzel