Fwd: signficant improvements to the containers package

Some people might be quite excited by Milan's work on significant
performance improvements to the containers package...
----- Forwarded message from Milan Straka
Data.Map is the way it is for historical reasons. It needs a dedicated performance maintainer to do detailed benchmarking and static analysis.
I did quite a lot of benchmarking, strictness analysis and performance improvements to the containers package recently, see the draft of my paper http://fox.ucw.cz/papers/containers/containers.pdf. Next week I will start finalizing the patches and submit them. The 'not generating a worker/wrapper' issue is one of several on my radar, so this should be resolved. Cheers, Milan ----- End forwarded message -----

On 24 June 2010 16:57, Don Stewart
Some people might be quite excited by Milan's work on significant performance improvements to the containers package...
This looks quite nice; good work Milan!!! As an aside, Alex Mason and I are discussing the possibility of taking advantage of AusHack *shameless plug* to write some kind of classes for the different types of containers with a hierarchy. I know about ListLike, but there doesn't seem to be any applicable classes for generic containers (i.e. the abstract API of a Set; something like ListLike would then be an extension on it) and for lookup data structures (Map k a, [(k, a)], etc.). -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On 24 June 2010 08:08, Ivan Miljenovic
As an aside, Alex Mason and I are discussing the possibility of taking advantage of AusHack *shameless plug* to write some kind of classes for the different types of containers with a hierarchy. I know about ListLike, but there doesn't seem to be any applicable classes for generic containers (i.e. the abstract API of a Set; something like ListLike would then be an extension on it) and for lookup data structures (Map k a, [(k, a)], etc.).
There are some classes for bulk types in Simon Peyton-Jones's paper "Bulk Types with Class". Personally, I'll take a lot of convincing that a class approach will be better than using the module system...

On 24 June 2010 17:15, Stephen Tetley
There are some classes for bulk types in Simon Peyton-Jones's paper "Bulk Types with Class".
Cool, I'll have a look.
Personally, I'll take a lot of convincing that a class approach will be better than using the module system...
My rational for a class approach is that rather than having your library spit out a list of values, etc. you let the consumer pick which data type they prefer (if they're going to be just converting your list into a Set, then why not give them a Set to start with?). -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On 24 June 2010 08:20, Ivan Miljenovic
My rational for a class approach is that rather than having your library spit out a list of values, etc. you let the consumer pick which data type they prefer (if they're going to be just converting your list into a Set, then why not give them a Set to start with?).
That's a bit more interesting, at least to my tastes. In my own work I often use specialised container types: e.g. rather than use a 'safe list' library that avoids certain list ops or wraps them as Maybe's, I prefer using a OneList: data OneList a = One a | Cons a (OneList a) With structures like a OneList their construction is more 'primary' than their manipulation. That's to say clearly when you build a OneList it can't be empty (and you obviously don't want it to be), but as a structure its not conducive to many manipulations, e.g: you can't filter it, as filtering can produce an empty list but the data type can't. Some "algebra" of constructors and destructors would be useful here, to get the filterInto function for example: filterInto :: Consable c => (a -> Bool) -> OneList a -> c a Consable could be the Coll class from "Bulk Types with Class" or probably better, a smaller one with just empty and cons, or even just cons inheriting Monoid. Unfortunately inheriting Monoid puts "legitimacy" on (++) which we know is bad on regular lists, and should discouraged where feasible. Similarly size in the Coll class is bad on regular lists... it goes on until will have single operation type classes for all the collection operations.

On 24 June 2010 17:55, Stephen Tetley
On 24 June 2010 08:20, Ivan Miljenovic
wrote: My rational for a class approach is that rather than having your library spit out a list of values, etc. you let the consumer pick which data type they prefer (if they're going to be just converting your list into a Set, then why not give them a Set to start with?).
That's a bit more interesting, at least to my tastes.
In my own work I often use specialised container types: e.g. rather than use a 'safe list' library that avoids certain list ops or wraps them as Maybe's, I prefer using a OneList:
data OneList a = One a | Cons a (OneList a)
Hmmm..... /me makes a note to see whether or not having an empty value is a requirement of such classes...
With structures like a OneList their construction is more 'primary' than their manipulation. That's to say clearly when you build a OneList it can't be empty (and you obviously don't want it to be), but as a structure its not conducive to many manipulations, e.g: you can't filter it, as filtering can produce an empty list but the data type can't.
Yeah, which can be a problem.
Some "algebra" of constructors and destructors would be useful here, to get the filterInto function for example:
filterInto :: Consable c => (a -> Bool) -> OneList a -> c a
Consable could be the Coll class from "Bulk Types with Class" or probably better, a smaller one with just empty and cons, or even just cons inheriting Monoid. Unfortunately inheriting Monoid puts "legitimacy" on (++) which we know is bad on regular lists, and should discouraged where feasible. Similarly size in the Coll class is bad on regular lists... it goes on until will have single operation type classes for all the collection operations.
If you inherit Monoid, then Bytestring can't be an instance of the class... This dichotomy is a problem: we'd like to have as many possible types be instances of such classes as possible; however, if we do that then we can't use pre-existing classes such as Functor since Set isn't an instance of Functor, etc. Anyway, since I've brought the whole thing up now rather than in the blog post I was planning to write this weekend, what are people's thoughts on preferred extension used: * MPTCs + fundeps Pros: older tech, more widely understood/used, ListLike already uses it Cons: I don't like the fact that the element type becomes a "first class member" of the type class; I'd prefer writing "SetLike s => Value s -> s -> s" to "SetLike s v => v -> s -> s" as IMHO the class represents the data type/structure and not the value it stores (that's a sub-part). * Type Families (actually using an Associated Type) Pros: IMHO cleaner type sigs (see above), in "Fun with Type Families", the authors call Type Families more like functional programming whereas fundeps are more like logic programming, and also state that type families play nicer with GADTs, etc. Cons: newer tech, so still being developed and used, etc. In particular, superclass constraints are not yet implemented in GHC (i.e. can't say something like "class (SetLike l, Value l ~ (k, v)) => MapLike l where ..." Unlike the choice of which gets used in FGL, I would like to think that this kind of set of classes would be more widely used in the community and will use whichever type of extension seems to be consensus as the better one (at least for this situation). If anyone knows a good site to be able to make an actual poll for this, that would probably be even better (or should we just have a wiki page where people put their names under whichever they think is better and then just tally up the number of names for each?). -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Thu, Jun 24, 2010 at 3:08 AM, Ivan Miljenovic
As an aside, Alex Mason and I are discussing the possibility of taking advantage of AusHack *shameless plug* to write some kind of classes for the different types of containers with a hierarchy. I know about ListLike, but there doesn't seem to be any applicable classes for generic containers (i.e. the abstract API of a Set; something like ListLike would then be an extension on it) and for lookup data structures (Map k a, [(k, a)], etc.).
Be sure to look into Okasaki's work on Edison. It has classes for
sequences (list-like structures) and collections (sets, heaps) and
associations (maps, priority queues) and a paper discussing the design
decisions.
--
Dave Menendez

On 25 June 2010 14:41, David Menendez
On Thu, Jun 24, 2010 at 3:08 AM, Ivan Miljenovic
wrote: As an aside, Alex Mason and I are discussing the possibility of taking advantage of AusHack *shameless plug* to write some kind of classes for the different types of containers with a hierarchy. I know about ListLike, but there doesn't seem to be any applicable classes for generic containers (i.e. the abstract API of a Set; something like ListLike would then be an extension on it) and for lookup data structures (Map k a, [(k, a)], etc.).
Be sure to look into Okasaki's work on Edison. It has classes for sequences (list-like structures) and collections (sets, heaps) and associations (maps, priority queues) and a paper discussing the design decisions.
Yeah, we will be. The reason this came up: Thomas Berekneyi wanted to use such classes for the rewrite of FGL, and when he discussed it on #haskell people generally indicated that edison was the best out there but was a bit long in the tooth and something like it should be re-written (though no-one seemed to volunteer... hmmm... :p). For example: it's a little weird that edison re-exports Data.Set and uses it for the instance with a type alias (same with map, Seq, etc.) rather than just using Data.Set itself. I also find the structuralInvariant and instanceName fields to be a little odd, whereas strict can now be shifted off to DeepSeq (or whatever the class is called). -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On 25 June 2010 05:58, Ivan Miljenovic
The reason this came up: Thomas Berekneyi wanted to use such classes for the rewrite of FGL, and when he discussed it on #haskell people generally indicated that edison was the best out there but was a bit long in the tooth and something like it should be re-written (though no-one seemed to volunteer... hmmm... :p).
Hi Ivan Note that if "new-FGL" depends on new packages it might not be acceptable for the Platform; see the thread that this message kicks off: http://www.haskell.org/pipermail/libraries/2009-August/012378.html Best wishes Stephen

Stephen Tetley
On 25 June 2010 05:58, Ivan Miljenovic
wrote: The reason this came up: Thomas Berekneyi wanted to use such classes for the rewrite of FGL, and when he discussed it on #haskell people generally indicated that edison was the best out there but was a bit long in the tooth and something like it should be re-written (though no-one seemed to volunteer... hmmm... :p).
Hi Ivan
Note that if "new-FGL" depends on new packages it might not be acceptable for the Platform; see the thread that this message kicks off:
http://www.haskell.org/pipermail/libraries/2009-August/012378.html
That's been brought up already, and yes I know. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Fri, Jun 25, 2010 at 12:58 AM, Ivan Miljenovic
On 25 June 2010 14:41, David Menendez
wrote: On Thu, Jun 24, 2010 at 3:08 AM, Ivan Miljenovic
wrote: As an aside, Alex Mason and I are discussing the possibility of taking advantage of AusHack *shameless plug* to write some kind of classes for the different types of containers with a hierarchy. I know about ListLike, but there doesn't seem to be any applicable classes for generic containers (i.e. the abstract API of a Set; something like ListLike would then be an extension on it) and for lookup data structures (Map k a, [(k, a)], etc.).
Be sure to look into Okasaki's work on Edison. It has classes for sequences (list-like structures) and collections (sets, heaps) and associations (maps, priority queues) and a paper discussing the design decisions.
Yeah, we will be.
The reason this came up: Thomas Berekneyi wanted to use such classes for the rewrite of FGL, and when he discussed it on #haskell people generally indicated that edison was the best out there but was a bit long in the tooth and something like it should be re-written (though no-one seemed to volunteer... hmmm... :p).
Edison could use some re-thinking; the state of the art in Haskell has advanced, and there are new classes like Data.Foldable and Data.Traversable to consider. In my mind, the big question is whether Sequence and Assoc should be constructor classes or use type families/fundeps. I lean towards the former, but it means that things like ByteSting can't be instances of Sequence.
For example: it's a little weird that edison re-exports Data.Set and uses it for the instance with a type alias (same with map, Seq, etc.) rather than just using Data.Set itself.
I believe that's so the implementations export a common interface. If you're in a situation where you want to use a specific implementation, Edison is designed so that you can import just the implementation module and avoid the overhead of the class system while still making it easy to switch implementations in the future.
I also find the structuralInvariant and instanceName fields to be a little odd,
I believe structuralInvariant is there for testing.
I'm not sure what instanceName is for. It's possible Edison predates
Data.Typeable, in which case instanceName might be useful for similar
purposes.
--
Dave Menendez

On Wed, Jun 23, 2010 at 11:57:54PM -0700, Don Stewart wrote:
Some people might be quite excited by Milan's work on significant performance improvements to the containers package...
Ah. This all looks excellent! Currently jhc spends about 30-40% of its time in the containers package according to profiling so improvements there will notably speed up compilation. Though, I already modifed a lot to use IntSets instead of Sets and independently discovered the HashSet/HashMap trick, but the gain should still be signifigant. The focus on tree-union is particularly cool, as I know a lot of that goes on inside the compiler. John -- John Meacham - ⑆repetae.net⑆john⑈ - http://notanumber.net/

Don Stewart
Some people might be quite excited by Milan's work on significant performance improvements to the containers package...
Yes, this is great news - both a well written article and an important piece of work on a cornerstone of the Haskell libraries. But I am also somewhat disturbed that, all this time I've been using Data.Set/Map, AVL has apparently been a much faster alternative, and I never got around to trying it out. Is there a downside to using AVL compared to Data.Set/Map? And would similar improvements apply to it? When performance is starting to matter, it is often because I have large Sets or Maps - it would also be interesting to compare the memory requirement of these data structures. (No matter how much faster a HashMap is, if it exceeds physical RAM it will be outperformed by any structure that manages to fit). -k -- If I haven't seen further, it is by standing in the footprints of giants
participants (7)
-
David Menendez
-
Don Stewart
-
Ivan Lazar Miljenovic
-
Ivan Miljenovic
-
John Meacham
-
Ketil Malde
-
Stephen Tetley