Data.* collections maintenance

Hello, As some demand seem to exist for the job, I volunteer to take over/resume maintenance of Data.Map and friends. In the hope that will alleviate the heavy load on the compiler maintainers :) Pending issues include : - Better integration of Map and Set (proposed by John Meacham) - More performance tests and QuickCheck properties (Christian Maeder) - Monoid instances (is this still current ?) Surely I have missed something, please state your unaddressed bug reports and wish list items. Cheers, JP.

Jean-Philippe Bernardy wrote:
As some demand seem to exist for the job, I volunteer to take over/resume maintenance of Data.Map and friends. In the hope that will alleviate the heavy load on the compiler maintainers :)
Thank you!
Pending issues include : - Better integration of Map and Set (proposed by John Meacham) - More performance tests and QuickCheck properties (Christian Maeder)
With respect to biasing I'ld like the following property between Sets and Maps. data Pair a b = a := b instance Eq/Ord a => Eq/Ord (Pair a b) where (a1 := _) <= (a2 := _) = a1 <= a2 and then let "Set (Pair a b)" correspond to "Map a b" for insert, union, intersection, etc. Christian

Hello folks, On Friday 21 Oct 2005 8:37 am, Jean-Philippe Bernardy wrote:
As some demand seem to exist for the job, I volunteer to take over/resume maintenance of Data.Map and friends. In the hope that will alleviate the heavy load on the compiler maintainers :)
Well not that I'm in any way biased :-) but might I suggest that if these libraries are going to get an API overhaul you could probably save yourself quite a bit of work (and get a performance boost too) if you switched from using size balanced trees (aka Adams trees ?) to using Data.Tree.AVL. AFAICT all the problems people seem to have mentioned recently with current Data.Set/Map APIs are already solved in Data.Tree.AVL or are trivially self solvable e.g. you can chose whether you want left or right biasing (AVL is uncommitted), intersection works on trees of different element types, strictness in elements is much more fine tuneable.. I was about to offer half finished AVL based clones of Data.Set/Map I have written (but not published), but having just checked it seems I must have deleted them at some point :-( I can remember implementation was quite easy though. The only reservation I have about this is the size function and indexing operations would be O(n) for AVL, not O(1) or O(log n). But the only reason size is O(1) for size balanced trees is that you pay extra for this information everywhere else, and I'm not sure that index functions should be supported at all in an abstracted Map API (they seem a bit unsafe to me). Does anybody use Data.Map indexing? and why doesn't Data.Set support this too? (just curious :-) Also, more generally it seems impossible to have a single optimal Set/Map implementation for all possible element or key types. IntSet/Map will perform better for Ints, StringMap will perform better for Strings (the latter could be generalised to ListMap I guess, or maybe even a SequenceMap). Where would these fit in the general scheme of things? Regards -- Adrian Hey

Hi Adrian, I've always wanted to try out your AVL implementation, but never found the time to adapt your interface to that of Data.Set and Data.Map (or a sub-interface). If you have done so, that would only mean to replace import entries (and would allow to switch back if it doesn't work as expected). (It would also be handy if your modules could be passed to the same properties and performance checker if we had those.) A different performance profile is of course fine (and desirable if there is a big gain in certain or even most cases). So far, the problem always was to agree on an accepted interface (that I didn't want to be a class but leave to the discipline of library writers.) How about something like Data.Set.Tree.AVL (and Data.Set.Seq.Ordered) or Data.SetByAVLTree (and Data.SetByOrderedSeq)? Cheers Christian Adrian Hey wrote:
Well not that I'm in any way biased :-) but might I suggest that if these libraries are going to get an API overhaul you could probably save yourself quite a bit of work (and get a performance boost too) if you switched from using size balanced trees (aka Adams trees ?) to using Data.Tree.AVL.
AFAICT all the problems people seem to have mentioned recently with current Data.Set/Map APIs are already solved in Data.Tree.AVL or are trivially self solvable e.g. you can chose whether you want left or right biasing (AVL is uncommitted), intersection works on trees of different element types, strictness in elements is much more fine tuneable..
I was about to offer half finished AVL based clones of Data.Set/Map I have written (but not published), but having just checked it seems I must have deleted them at some point :-( I can remember implementation was quite easy though.
The only reservation I have about this is the size function and indexing operations would be O(n) for AVL, not O(1) or O(log n). But the only reason size is O(1) for size balanced trees is that you pay extra for this information everywhere else, and I'm not sure that index functions should be supported at all in an abstracted Map API (they seem a bit unsafe to me). Does anybody use Data.Map indexing? and why doesn't Data.Set support this too? (just curious :-)
Also, more generally it seems impossible to have a single optimal Set/Map implementation for all possible element or key types. IntSet/Map will perform better for Ints, StringMap will perform better for Strings (the latter could be generalised to ListMap I guess, or maybe even a SequenceMap). Where would these fit in the general scheme of things?
Regards -- Adrian Hey
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Christian Maeder wrote:
So far, the problem always was to agree on an accepted interface (that I didn't want to be a class but leave to the discipline of library writers.)
What exactly does prohibit the use of a class here? What about the class design of Edison (I admit I never used it). http://www.haskell.org/ghc/docs/edison/users004.html#toc3 IMHO the use of interfaces (instead of relying on the discipline of library writers and users) is the important step forward from the collections framework of STL (C++) to that of Java. We certainly don't want to fall behind that, or if we do, we need to have very good reasons. -- -- Johannes Waldmann -- Tel/Fax (0341) 3076 6479/80 -- ---- http://www.imn.htwk-leipzig.de/~waldmann/ -------

First of all, thank you to everyone who has volunteered to help/maintain the Data.Set/Map/etc code. I believe that standard and *stable* libraries for basic data structures are very important to the community, and encourage any improvements or changes to the code. I agree that the whole show/read part of the library is not well thought out -- I guess there should be the standard read/show instances together with some pretty show functions for debugging/education. However, keep in mind that the goal and design of the Data.Set/Map/etc structures were to be a stable and complete concrete design of data structures with documented worst-case time complexity. It was not designed as a general data structure interface that could accommodate many different implementations. i.e.
What exactly does prohibit the use of a class here? What about the class design of Edison (I admit I never used it). http://www.haskell.org/ghc/docs/edison/users004.html#toc3
That is why I think that we should not use classes or anything fancy here. The design of a proper general data structure interface is desirable and a very interesting challenge, but it should belong in a different module/library (Data.Generic.Set, Data.Edison, ?) that perhaps uses Data.Set/Map/etc, or Adrian's AVL modules to provide implementations. All the best, -- Daan. ps. Just to be clear, the code does not belong to me any more and the libraries are free to use and modify, and I wouldn't mind at all if someone wants to take a different direction with it (especially since I can not contribute myself anyway :-). In general, I think it should be easier for everyone to contribute code to the library base. This was one of the discussion topics at the Haskell workshop and I think this is a good case where it seems to be difficult to contribute easily: there seemed to be many people that had their own read instances but weren't able to add this code to the standard libraries -- I am all for anarchy and fun ;-) and I think the libraries would be better because of it. Maybe we should think about how patches could be easily contributed -- should we move over to darcs for the standard library code? Together with clear instructions on how to send in patches? -- maybe even allow everyone to just commit...
IMHO the use of interfaces (instead of relying on the discipline of library writers and users) is the important step forward from the collections framework of STL (C++) to that of Java.
We certainly don't want to fall behind that, or if we do, we need to have very good reasons.

Daan Leijen wrote:
Maybe we should think about how patches could be easily contributed -- should we move over to darcs for the standard library code? Together with clear instructions on how to send in patches? -- maybe even allow everyone to just commit...
I have some experience (as the person whose butt was on the line for the results) with 'allow everyone to commit' on a large (commercial) library (10^6 LOC of high-level code) and ~40 developers, with 15 full-time and the rest university researchers. Our approach was (is still) to have a HUGE automated test suite, and have it properly instrumented. In other words, each test case is a pair (test, dependencies) where dependencies records *everything* needed during the run of that test. This allows two things: 1) patchers can run testall -opens file1.hs -opens file2.hs and make sure that everything they've done seems to work (locally) 2) a central repository can (automatically!) a) accept a patch, 'install' it on a copy of local stable version b) run testall -opens file1.hs -opens file2.hs on all the patched files c) accept the patch into the next stable system if _everything_ was fine, reject it otherwise (with an email to the author with the failures encountered) d) additionally, a nightly run of the complete suite is also required to catch failures due to 'new' dependencies This level of automation is needed when you have a lot of people working all at once - humans can't keep up. This also means that there is always a stable system available. Or as stable as the quality of the test suite. This makes it a first-check-in system, where a person who checks in something that works in isolation but breaks on the 'latest' stable system the one responsible to figure out why. The actual system is more sophisticated in that it also tracks resource usage (time, memory) for each test and reports on wide variations of those too. These are not currently reasons for automatic rejection, though I would make it so. I would also add an automatic check that 'new' functionality comes with new automated tests as well, else it would be rejected (automatically). The 'dependencies' list serves two purposes: 1) tell the patcher how wide an effect their changes has the potential to be, by the size of the test suite it pulls in 2) lower the load on the central machine, as full test suite usually take hours to run (if they are a decent size). The above system has successfully been used in a distributed environment for about 7 years now, and seems to be a good balance between total anarchy and central control. Once the developers are trained to always write automated tests, the system can grow very quickly without either being in a broken state or in a 'waiting for review' state all the time. I would strongly recommend against 'allow everyone to just commit' without the presence of a large automated test suite which is used to (automatically) reject code that breaks a test. Jacques

Jacques Carette
I would strongly recommend against 'allow everyone to just commit' without the presence of a large automated test suite which is used to (automatically) reject code that breaks a test.
Darcs supports 'run test before checkin' and if we had darcs, people would contribute tests too. -- Shae Matijs Erisson - http://www.ScannedInAvian.com/ - Sockmonster once said: You could switch out the unicycles for badgers, and the game would be the same.

Am Freitag, 21. Oktober 2005 22:54 schrieb Shae Matijs Erisson:
Jacques Carette
writes: I would strongly recommend against 'allow everyone to just commit' without the presence of a large automated test suite which is used to (automatically) reject code that breaks a test.
Darcs supports 'run test before checkin' and if we had darcs, people would contribute tests too.
Well, I appreciate everybody's enthusiasm for darcs, but the statement above is highly misleading and *not* a valid reason at all to change to darcs: We could easily have this CVS already (and of course with Subversion etc.), see: http://ximbiot.com/cvs/manual/cvs-1.12.13/cvs_18.html#SEC190 Our main problem here is not technical, it is simply a lack of a decent test suite for all the library packages in the repository. And I totally agree with Jacques here: As long as we don't have this test suite, I'll furiously object to 'allow everyone to just commit'. This will simply not work and will actually turn away much more people from Haskell due to the resulting instability and API volatility than attract people to our beloved language. I see a dire need for more Haskell *maintainers*, not for more developers or brand new shiny version control systems. With "maintainers" I mean people accepting/testing/merging patches, discussing with people about APIs, kicking coders to write tests for their code, collecting opinions and writing down API proposals, keeping existing APIs stable/sane/usable, etc. This is a lot of work, it's difficult, time-consuming and much less fun than coding, but we need those people. Darcs won't help with this at all... Don't get me wrong: I think that darcs is a great tool and did a lot to improve Haskell's visibility and acceptance, but it is not a solution for the problems we have. People have always a tendency to propose technical solutions for non-technical problems, this is exactly what I'm observing currently on this list... Cheers, S.

Sven Panne
Our main problem here is not technical, it is simply a lack of a decent test suite for all the library packages in the repository. And I totally agree with Jacques here: As long as we don't have this test suite, I'll furiously object to 'allow everyone to just commit'.
I agree, and there should certainly be people in charge, maintaining a certain quality control (Linux has a good model there, I think.)
Don't get me wrong: I think that darcs is a great tool and did a lot to improve Haskell's visibility and acceptance, but it is not a solution for the problems we have.
Darcs *does* make it easier to submit small patches, however. If I have a darcs repo, I can type "darcs send" to propagate fixes in the right direction. With CVS I would have to find the person in charge, generate diffs in the proper format, and mail them. That's not too difficult, of course, but at least for me, darcs lowers the threshold for participation. -k -- If I haven't seen further, it is by standing in the footprints of giants

On 10/24/05, Ketil Malde
Darcs *does* make it easier to submit small patches, however. If I have a darcs repo, I can type "darcs send" to propagate fixes in the right direction. With CVS I would have to find the person in charge, generate diffs in the proper format, and mail them. That's not too difficult, of course, but at least for me, darcs lowers the threshold for participation.
Another point is that with darcs it's easy to manage your own branch of the repo and still be up-to-date with main development. For example, you can make your own changes to the compiler/libraries without disturbing the GHC team at all, and still have your changes properly versioned. Later, if they think your changes are valuable, they can easily (modulo conflicts) take the patches from you. Best regards Tomasz

Sven Panne
Our main problem here is not technical, it is simply a lack of a decent test suite for all the library packages in the repository. And I totally agree with Jacques here: As long as we don't have this test suite, I'll furiously object to 'allow everyone to just commit'. This will simply not work and will actually turn away much more people from Haskell due to the resulting instability and API volatility than attract people to our beloved language.
While I mostly agree with what you're saying, as a point of clarity, I want to mention that darcs does _not_ "allow everyone to commit" and therefore won't have the impact upon stability that you are worrrying about. Darcs allows any user to commit to their own local repository, and makes it easy to merge changes between repositories and send changes upstream. There would still be a central / official repository that would be stable (well, as stable as our CVS repo is now).
I see a dire need for more Haskell *maintainers*, not for more developers or brand new shiny version control systems. With "maintainers" I mean people accepting/testing/merging patches, discussing with people about APIs, kicking coders to write tests for their code, collecting opinions and writing down API proposals, keeping existing APIs stable/sane/usable, etc. This is a lot of work, it's difficult, time-consuming and much less fun than coding, but we need those people.
I totally agree with this.
Darcs won't help with this at all...
I'm not sure of this, but I do agree with Simon about attracting folks. peace, isaac

On Fri, Oct 21, 2005 at 05:06:34PM +0200, Johannes Waldmann wrote:
Christian Maeder wrote:
So far, the problem always was to agree on an accepted interface (that I didn't want to be a class but leave to the discipline of library writers.)
What exactly does prohibit the use of a class here? What about the class design of Edison (I admit I never used it). http://www.haskell.org/ghc/docs/edison/users004.html#toc3
The idea is (I believe) to get concrete implementations of things first, then try out and discuss several class structures on top of them and see which one is most agreeable. (there is no reason there couldn't be multiple compteing ones) since concrete implementations need to exist either way, it is something we need to work on and in any case and having them have similar interfaces will make creating an appropriate class (and using the concrete implementations) easier. John -- John Meacham - ⑆repetae.net⑆john⑈

Am Freitag, 21. Oktober 2005 16:41 schrieb Christian Maeder:
Hi Adrian,
I've always wanted to try out your AVL implementation, but never found the time to adapt your interface to that of Data.Set and Data.Map (or a sub-interface). If you have done so, that would only mean to replace import entries (and would allow to switch back if it doesn't work as expected).
(It would also be handy if your modules could be passed to the same properties and performance checker if we had those.)
I heard that there have been such tests? Who has them and why not putting them into the testsuite in the ghc source distro.
A different performance profile is of course fine (and desirable if there is a big gain in certain or even most cases).
So far, the problem always was to agree on an accepted interface (that I didn't want to be a class but leave to the discipline of library writers.)
How about something like Data.Set.Tree.AVL (and Data.Set.Seq.Ordered) or Data.SetByAVLTree (and Data.SetByOrderedSeq)?
Why not having a class for Set and Map,... and then we can change the implementation in the instances. Furthermore we could specify optimised versions like IntMap as an instance. Georg
Cheers Christian
Adrian Hey wrote:
Well not that I'm in any way biased :-) but might I suggest that if these libraries are going to get an API overhaul you could probably save yourself quite a bit of work (and get a performance boost too) if you switched from using size balanced trees (aka Adams trees ?) to using Data.Tree.AVL.
AFAICT all the problems people seem to have mentioned recently with current Data.Set/Map APIs are already solved in Data.Tree.AVL or are trivially self solvable e.g. you can chose whether you want left or right biasing (AVL is uncommitted), intersection works on trees of different element types, strictness in elements is much more fine tuneable..
I was about to offer half finished AVL based clones of Data.Set/Map I have written (but not published), but having just checked it seems I must have deleted them at some point :-( I can remember implementation was quite easy though.
The only reservation I have about this is the size function and indexing operations would be O(n) for AVL, not O(1) or O(log n). But the only reason size is O(1) for size balanced trees is that you pay extra for this information everywhere else, and I'm not sure that index functions should be supported at all in an abstracted Map API (they seem a bit unsafe to me). Does anybody use Data.Map indexing? and why doesn't Data.Set support this too? (just curious :-)
Also, more generally it seems impossible to have a single optimal Set/Map implementation for all possible element or key types. IntSet/Map will perform better for Ints, StringMap will perform better for Strings (the latter could be generalised to ListMap I guess, or maybe even a SequenceMap). Where would these fit in the general scheme of things?
Regards -- Adrian Hey
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
-- ---- Georg Martius, Tel: (+49 34297) 89434 ---- ------- http://www.flexman.homeip.net ----------

On Fri, Oct 21, 2005 at 05:15:54PM +0200, Georg Martius wrote:
Why not having a class for Set and Map,... and then we can change the implementation in the instances. Furthermore we could specify optimised versions like IntMap as an instance.
One class or two, one hierarchy or two? There are several proposals (see this list in Mar and Apr last year), but none is ideal, and they all need multi-parameter type classes.

Hello Christian On Friday 21 Oct 2005 3:41 pm, Christian Maeder wrote:
(It would also be handy if your modules could be passed to the same properties and performance checker if we had those.)
Yes, definining a common API which useable in actual applications seems rather difficult, but even if we just agreed a common simple API for correctness testing and benchmarking that would be useful (these could be made much more thorough if they didn't have to be re-written for each new implementation).
A different performance profile is of course fine (and desirable if there is a big gain in certain or even most cases).
AVL seems to gain in two ways. 1- It does a better job of balancing in pathological cases like growing a tree from a sorted list (where the tree is not known to be sorted). 2- The Hedge algorithm seems to be a bit of a problem IMO. The actual benchmark times for AVL union and Data.Set union aren't significantly different for sets of Ints (AVL seemed marginally faster, but nothing to get excited about). But if you count the number of comparisons actually performed Data.Set (Hedge algorithm) is doing about 3..5 times as many comparisons. Presumably this would be reflected in measured execution times in cases where set element comparison was less trivial than comparing Ints. Also, you'd expect the additional indirection overhead of using an AVL tree of association pairs (vs. specialised Data.Map) to slow down lookup somewhat. But it doesn't seem to significantly IIRC, even with Ints. Not sure why. Maybe better balancing is giving shorter search paths, or maybe use of Ord dictionary passing (vs. HOFs) is more expensive. One problem with AVL is the complexity of most of the code means producing specialised unboxed implementations manually is not an attractive proposition. Hopefully compilers will do this sort of thing for us one day. But in reality I expect this sort of thing to be more of an issue for benchmark style code using Ints. For more complex keys and comparisons the indirection cost will be relatively insignificant I think. For polymorpic implementations counting the actual number of comparisons required by some function is just as important IMO.
So far, the problem always was to agree on an accepted interface (that I didn't want to be a class but leave to the discipline of library writers.)
How about something like Data.Set.Tree.AVL (and Data.Set.Seq.Ordered) or Data.SetByAVLTree (and Data.SetByOrderedSeq)?
Dunno. At one time I did have something like Data.Tree.AVL.Clone.Data.Set and Data.Tree.AVL.Clone.Data.Map, but I never actually posted the implementations. (I wish I had now because it seems I've lost them.) Regards -- Adrian Hey

On Saturday 22 Oct 2005 11:57 am, Adrian Hey wrote:
2- The Hedge algorithm seems to be a bit of a problem IMO. The actual benchmark times for AVL union and Data.Set union aren't significantly different for sets of Ints (AVL seemed marginally faster, but nothing to get excited about). But if you count the number of comparisons actually performed Data.Set (Hedge algorithm) is doing about 3..5 times as many comparisons. Presumably this would be reflected in measured execution times in cases where set element comparison was less trivial than comparing Ints.
I've just re-run some old benchmarking code to see what the latest situation is re. this. There are 3 tests, all with random Ints: Test1 : union of two different sets, both of size 200000 Test2 : union of two different sets, of size 200000 and 200 Test3 : union of two different sets, of size 200 and 200000. Relative execution times (in no particular units) are: Data.Tree.AVL Data.Set Test1: 0.51 0.66 Test2: 1.07e-3 2.35e-3 Test3: 1.07e-3 2.32e-3 So it seems if there's a big difference in set size the difference in performance isn't so marginal after all. The above tests do not include comparison counting overhead. If I count the number of comparisons the same tests give.. Data.Tree.AVL Data.Set Test1: 470668 1562230 Test2: 2449 11336 Test3: 2445 11238 Regards -- Adrian Hey
participants (14)
-
Adrian Hey
-
Christian Maeder
-
Daan Leijen
-
Georg Martius
-
Isaac Jones
-
Jacques Carette
-
Jean-Philippe Bernardy
-
Johannes Waldmann
-
John Meacham
-
Ketil Malde
-
Ross Paterson
-
Shae Matijs Erisson
-
Sven Panne
-
Tomasz Zielonka