Data.Generics with GPS (using Maps to avoid getting lost in Data)

summary: speed up Syb with Uniplate-inspired techniques == Performance issues in Syb == As it stands, classic Syb, while well-supported in GHC isn't exactly the fastest generic programming option. In fact, it routinely seems to come out last in performance comparisons, sometimes by a substantial factor. That isn't always a problem in practice, because the gains in expressiveness/conciseness/maintainability outweigh the loss in performance, because traversal performance is not the bottleneck, or because practical use often involves hand-tuned traversal schemes instead of the example schemes typically used in benchmarks. Be that as it may, performance is a consideration, negative results have been published, and it seems that performance of experimental code has led to Syb being abandoned in at least one project. While it would be unrealistic to expect Syb traversals to compete with hand-written ones with current compilers, there are several obvious areas where Syb traversal performance could be subjected to improvements: (a) traversing irrelevant parts of the structure (because everything is treated generically) (b) combining results in simple, but inefficient ways ((++) nesting with the structure) (c) repeated runtime type checks to determine which function in a generically extended transformation/query to apply While runtime type checks are inherent in SYB (and alternatives have been proposed that claim to avoid them), all of a/b/c can be addressed to some extent in defining tuned traversals using the basic library. There tends to be a trade-off for a vs c, as avoiding senseless generic traversals of specific substructures implies additional runtime type checks to identify those substructures == Avoiding irrelevant traversals with substructure type maps == This message is concerned with addressing (a), in a generic way that does not require hand-tuning to the types in question, and without adding to the burden of (c). The basic idea comes from the Uniplate library, particularly the version implemented on top of Data (Data.Generics.PlateData). The idea is to compute the set of substructure types for a type under traversal and was described in last year's Uniplate paper ([1], or chapter 3 of Neil's recent thesis [2], section 3.6.2, "Optimising PlateData"), though its generality is somewhat obscured by the claim that: The next optimisation relies on the extra information present in the Uniplate operations - namely the target type. A boilerplate operation walks over a data structure, looking for target values to process. In SYB, the target values may be of any type. For Uniplate the target is a single uniform type. If a value is reached which is not a container for the target type, no further exploration is required of the values children. A Data-based 'contains' operation is defined that returns the list of immediate substructure types, represented by existentially boxed 'undefined's of the appropriate types, for all constructors in a given type. This operation can then be iterated to given all substructure types. One interesting point to note is that -though Syb is data-based- the computation of substructure types is based on reflecting a type-based view back into data. For sum types, every concrete data value only presents a subset of substructure types, but Syb's reflection features make it possible to talk about all possible constructors (*). The performance benefits can be substantial - the draft thesis claims In the benchmarks we improve on SYB by between 30% and 225%, with an average of 145% faster. So it seemed worthwhile to generalise this technique for use in Syb, which turned out to be possible after all, leading to the promised Data.Generics with GPS (Generic Positioning System;-). == Data.Generics.GPS == GPS employs Maps, to avoid getting lost in Data: - for each traversed type, build a Map TypeKey TypeSet, mapping all substructure types of the given type to their substructure types - traversals are short-circuited when the domain types of their queries or transformations cannot be found in the current substructure types - domains of queries and transformations are computed on construction GPS is inspired by Uniplate's PlateData direction finder (contains and DataBox are copied from the Uniplate paper), generalised to tackle SYB's more general queries and transformations (instead of oracles telling whether to stop, follow, or find in a search for type b in type a, we build IntSets of TypeRep keys, both for the domains of traversals and for substructure types; then several short-circuiting decisions can be based on fast intersection tests with the same IntSet). Data.Generics.GPS reexports Data.Generics, modifying everything, everywhere, mkQ, extQ, mkT, extT in such a way that building and extending transformations/queries also computes and records their domains and default transformations/values: data GenericDomainQ r = GenericDomainQ { queryDomain :: TypeSet, defaultValue :: r, genericQuery :: GenericQ r } data GenericDomainT = GenericDomainT { transDomain :: TypeSet, defaultTrans :: GenericT, genericTrans :: GenericT } while traversals accept those refined transformations/queries and add substructure type maps for short-circuiting, eg: everywhere :: forall a . Data a => GenericDomainT -> a -> a everywhere = everywhereWithMap (getSubs subMap) where subMap = fromRoot (undefined::a) Map.empty everywhereWithMap :: forall a . Data a => (forall a . Data a => a -> TypeSet) -> GenericDomainT -> a -> a everywhereWithMap getSubs gdt@(GenericDomainT{transDomain=domain, defaultTrans=dt, genericTrans=t}) x | not $ IS.null $ domain `intersection` getSubs x = t (gmapT (everywhereWithMap getSubs gdt) x) | otherwise = dt x == Performance testing in Paradise == To test the performance, I used the example in which Syb performed worst in the Uniplate paper: computing the bill in the Paradise benchmark. To get easily measurable timings, I added 100000 redundant copies of the departments in genCom. Switching from Data.Generics to Data.Generics.GPS is as easy as changing imports (and possibly types): import qualified Data.Generics as DG import Data.Generics.GPS bill :: Data a => a -> Integer bill = DG.everything (+) (0 `DG.mkQ` billS) where billS (S s) = s bill' :: Data a => a -> Integer bill' = everything (+) (0 `mkQ` billS) where billS (S s) = s Performance improves drastically, as expected (close to hand-tuned versions of everything and everywhere that explicitly skip the Strings in the Paradise types), even though substructure type maps are not yet as widely shared as in PlateData (as trace-instrumented type map construction shows). Unfortunately, a serious performance problem in Uniplate (of which Neil is already aware) completely masks the PlateData optimisations, so it is not yet visible whether the gap between PlateData and Syb has disappeared completely: -- ghc --make -O2 Main.hs; GHC version 6.9.20080514 $ ./Main.exe > dump "Data.Generics increase: 5 secs" 5550200000 "Data.Generics bill: 12 secs" "Data.Generics.GPS increase: 2 secs" 5550200000 "Data.Generics.GPS bill: 3 secs" "Data.Generics.PlateData increase: 1 min, 59 secs" 5550200000 "Data.Generics.PlateData bill: 1 min, 54 secs" == General implications of the techniques used == TypeRep keys combined with IntSets or IntMaps should be of general interest to reader of this list, as they can be used to speed up other generic programming problems as well, including typecase and extensible records libraries. Two examples should make this connection obvious: - generic queries and transformations generalise record selection and update (particularly plain to see in Uniplates universeBi/transformBi) - the core of Smash is to replace Syb's runtime type checks with compile-time type checks, but Smash's static typecase is based on HList's extensible record selection, encoded in type-class programs which current compilers do not yet partially evaluate to entirely static, constant-time selection; so, while Smash conceptually replaces runtime type selection with compile-time type selection, its compile-time type selection is compiled into runtime type selection Now, think about combining conceptually nice type-class-based type-case and record operations with pragmatically efficient IntMap-based runtime representations and selection/modification operations!-) Feedback, comments, bug-reports welcome, as usual. Claus [1] http://www-users.cs.york.ac.uk/~ndm/uniplate/ [2] http://www-users.cs.york.ac.uk/~ndm/thesis/ (*) this wouldn't work so well on non-regular types, because of the potentially infinite set of substructure types

Dear Generic folk [I'm spamming libraries@haskell.org too, in case anyone interested in generics is not on generics@haskell.org.] As you know, Claus has offered a somewhat-detailed proposal for changes to the SYB library (below). But I don't think that we have an active maintainer for any of the generic-programming libraries (esp SYB) apart from Uniplate. Then there's the related question of what generic-programming technology to promote for clients of the GHC API. The obvious candidates are Claus himself, or Alexey Rodriguez, or Thomas Schilling; but perhaps there are others too? Maybe no one has stepped forward because you all think that I'm on the job! But I'm not... I'm busy with GHC itself, and would love a maintainer for SYB and associated gubbins. I fear that otherwise we may lose the benefits of Claus's homework. Simon | -----Original Message----- | From: generics-bounces@haskell.org [mailto:generics-bounces@haskell.org] On Behalf Of Claus Reinke | Sent: 21 July 2008 14:16 | To: generics@haskell.org | Subject: [Hs-Generics] Data.Generics with GPS (using Maps to avoid getting lost in Data) | | | summary: speed up Syb with Uniplate-inspired techniques | | == Performance issues in Syb == | | As it stands, classic Syb, while well-supported in GHC isn't exactly the | fastest generic programming option. In fact, it routinely seems to come | out last in performance comparisons, sometimes by a substantial factor. | | That isn't always a problem in practice, because the gains in | expressiveness/conciseness/maintainability outweigh the loss in | performance, because traversal performance is not the bottleneck, or | because practical use often involves hand-tuned traversal schemes | instead of the example schemes typically used in benchmarks. | | Be that as it may, performance is a consideration, negative results have | been published, and it seems that performance of experimental code has | led to Syb being abandoned in at least one project. While it would be | unrealistic to expect Syb traversals to compete with hand-written ones | with current compilers, there are several obvious areas where Syb | traversal performance could be subjected to improvements: | | (a) traversing irrelevant parts of the structure (because everything is | treated generically) | (b) combining results in simple, but inefficient ways ((++) nesting with | the structure) | (c) repeated runtime type checks to determine which function in a | generically extended transformation/query to apply | | While runtime type checks are inherent in SYB (and alternatives have | been proposed that claim to avoid them), all of a/b/c can be addressed | to some extent in defining tuned traversals using the basic library. | There tends to be a trade-off for a vs c, as avoiding senseless generic | traversals of specific substructures implies additional runtime type | checks to identify those substructures [...rest snipped...]

Hi, Simon Peyton-Jones wrote:
Dear Generic folk
[I'm spamming libraries@haskell.org too, in case anyone interested in generics is not on generics@haskell.org.]
As you know, Claus has offered a somewhat-detailed proposal for changes to the SYB library (below). But I don't think that we have an active maintainer for any of the generic-programming libraries (esp SYB) apart from Uniplate. Then there's the related question of what
I do not know how much maintaining is going on, but at least the HappS folks have packaged SYB and put it on Hackage: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/syb-with-class-0.... Greetings, Mads

[I'm spamming libraries@haskell.org too, in case anyone interested in generics is not on generics@haskell.org.]
Given the low number of responders on generics@, it may well be easier to continue on libraries@, cc-ing anyone on generics@ who isn't on libraries@.
As you know, Claus has offered a somewhat-detailed proposal for changes to the SYB library (below). But I don't think that we have an active maintainer for any of the generic-programming libraries (esp SYB) apart from Uniplate. Then there's the related question of what generic-programming technology to promote for clients of the GHC API.
Thanks for raising this, Simon. I've actually been holding an email summarizing several issues (not just performance of default traversal schemes) that I'd like to see adressed in Syb (holding because the Syb authors were/are away, and because my performance improvement experiments are currently stuck on a GHC optimization issue). I'll send that email separately now.
The obvious candidates are Claus himself, or Alexey Rodriguez, or Thomas Schilling; but perhaps there are others too? Maybe no one has stepped forward because you all think that I'm on the job! But I'm not... I'm busy with GHC itself, and would love a maintainer for SYB and associated gubbins. I fear that otherwise we may lose the benefits of Claus's homework.
I'm quite willing to continue pursueing the issues I've raised until I can make concrete suggestions for improving Syb, including summing up the code changes I've been adding to my various messages (certainly there should be patches to accompany proposal tickets in the library process, and I should collect all the strands of text into a single document). I have been waiting for the original Syb authors to return from their well-earned summer camps, but there should probably be a Wiki page somewhere specifically for discussing Syb-related issues and solutions (meanwhile, I've started collecting links/info related to GHC Api type traversals here, including the main Syb issues: http://hackage.haskell.org/trac/ghc/wiki/GhcApiAstTraversals , please feel free to copy stuff from there to a Syb-specific page). But I wouldn't want to take on ownership of Syb at this point, for two reasons, both motivational: (a) it helps to have someone else to "blame" when the consequences of gfoldl's type once again hurt my brain;-), (b) it is really frustrating to get so little interest in these issues, well, we haven't even managed to start a proper discussion on any of the lists I've tried, and as long as there is a Syb owner other than myself, at least I won't be talking entirely to myself!-) Claus

Dear Claus, Simon and others,
[I'm spamming libraries@haskell.org too, in case anyone interested in generics is not on generics@haskell.org.]
Given the low number of responders on generics@, it may well be easier to continue on libraries@, cc-ing anyone on generics@ who isn't on libraries@.
I know we have been rather quiet, but please keep mailing to the generics list. As a reaction to Simon's request, we have been discussing (among ourselves) whether or not to take over SYB support in Utrecht. At the moment we are working on a release of a version of EMGM, and a new library base on type families and GADTs (about which we have written a TR available via http://www.cs.uu.nl/research/techreps/UU-CS-2008-019.html) . I think the general feeling is that it would be a good thing to also release/maintain SYB, but we want to talk about this a bit more, and right now I am in Birmingham for a conference. I'll be back in Utrecht on Thursday, and we can discuss it then. Hopefully we can respond shortly after Thursday. Kind regards, Johan
As you know, Claus has offered a somewhat-detailed proposal for changes to the SYB library (below). But I don't think that we have an active maintainer for any of the generic-programming libraries (esp SYB) apart from Uniplate. Then there's the related question of what generic-programming technology to promote for clients of the GHC API.
Thanks for raising this, Simon. I've actually been holding an email summarizing several issues (not just performance of default traversal schemes) that I'd like to see adressed in Syb (holding because the Syb authors were/are away, and because my performance improvement experiments are currently stuck on a GHC optimization issue). I'll send that email separately now.
The obvious candidates are Claus himself, or Alexey Rodriguez, or Thomas Schilling; but perhaps there are others too? Maybe no one has stepped forward because you all think that I'm on the job! But I'm not... I'm busy with GHC itself, and would love a maintainer for SYB and associated gubbins. I fear that otherwise we may lose the benefits of Claus's homework.
I'm quite willing to continue pursueing the issues I've raised until I can make concrete suggestions for improving Syb, including summing up the code changes I've been adding to my various messages (certainly there should be patches to accompany proposal tickets in the library process, and I should collect all the strands of text into a single document). I have been waiting for the original Syb authors to return from their well-earned summer camps, but there should probably be a Wiki page somewhere specifically for discussing Syb-related issues and solutions (meanwhile, I've started collecting links/info related to GHC Api type traversals here, including the main Syb issues: http://hackage.haskell.org/trac/ghc/wiki/GhcApiAstTraversals , please feel free to copy stuff from there to a Syb-specific page).
But I wouldn't want to take on ownership of Syb at this point, for two reasons, both motivational: (a) it helps to have someone else to "blame" when the consequences of gfoldl's type once again hurt my brain;-), (b) it is really frustrating to get so little interest in these issues, well, we haven't even managed to start a proper discussion on any of the lists I've tried, and as long as there is a Syb owner other than myself, at least I won't be talking entirely to myself!-)
Claus
_______________________________________________ Generics mailing list Generics@haskell.org http://www.haskell.org/mailman/listinfo/generics

Hi
As you know, Claus has offered a somewhat-detailed proposal for changes to the SYB library (below). But I don't think that we have an active maintainer for any of the generic-programming libraries (esp SYB) apart from Uniplate. Then there's the related question of what generic-programming technology to promote for clients of the GHC API.
Thanks for raising this, Simon. I've actually been holding an email summarizing several issues (not just performance of default traversal schemes) that I'd like to see adressed in Syb (holding because the Syb authors were/are away, and because my performance improvement experiments are currently stuck on a GHC optimization issue). I'll send that email separately now.
I think SYB would best be maintained by someone who does not already maintain some kind of boilerplate removal library. There are lots of experiments into GADT's and other mechanisms, but SYB1+2 is a very useful design point - and one that should be preserved. I think the maintainer should do three things in addition to general release management: 1) Speed improvements, if possible 2) API tweaks, maybe a few extra functions (universe equivalent would be nice) 3) Make it work with Hugs - I've always been surprised that SYB doesn't work with Hugs, and I don't think its that much work. As a result of these points, I think Claus is probably the perfect person to take over as maintainer.
(a) it helps to have someone else to "blame" when the consequences of gfoldl's type once again hurt my brain;-),
You can still blame those that went before - I don't think an SYB maintainer should be changing the type of gfoldl - its too fundamental.
(b) it is really frustrating to get so little interest in these issues, well, we haven't even managed to start a proper discussion on any of the lists
I am interested. I have starred your emails, and will respond in the next few days. I've been in a tent without electricity for the last few days! Thanks Neil

On Mon, Jul 28, 2008 at 11:50 PM, Neil Mitchell
Hi
As you know, Claus has offered a somewhat-detailed proposal for changes to the SYB library (below). But I don't think that we have an active maintainer for any of the generic-programming libraries (esp SYB) apart from Uniplate. Then there's the related question of what generic-programming technology to promote for clients of the GHC API.
Thanks for raising this, Simon. I've actually been holding an email summarizing several issues (not just performance of default traversal schemes) that I'd like to see adressed in Syb (holding because the Syb authors were/are away, and because my performance improvement experiments are currently stuck on a GHC optimization issue). I'll send that email separately now.
I think SYB would best be maintained by someone who does not already maintain some kind of boilerplate removal library. There are lots of experiments into GADT's and other mechanisms, but SYB1+2 is a very useful design point - and one that should be preserved.
I agree that SYB1+2 is a useful design point and that improvements should not affect the basic functionality and design. A library with a different gfoldl would very likely change SYB significantly. However, I don't think that there is a problem with a person maintaining several generic programming libraries, as long as the maintainer effects changes/improvements on SYB conservatively. I think the
maintainer should do three things in addition to general release management:
1) Speed improvements, if possible
2) API tweaks, maybe a few extra functions (universe equivalent would be nice)
3) Make it work with Hugs - I've always been surprised that SYB doesn't work with Hugs, and I don't think its that much work.
As a result of these points, I think Claus is probably the perfect person to take over as maintainer.
(a) it helps to have someone else to "blame" when the consequences of gfoldl's type once again hurt my brain;-),
You can still blame those that went before - I don't think an SYB maintainer should be changing the type of gfoldl - its too fundamental.
(b) it is really frustrating to get so little interest in these issues, well, we haven't even managed to start a proper discussion on any of the lists
I am interested. I have starred your emails, and will respond in the next few days. I've been in a tent without electricity for the last few days!
I would be fine with Claus as maintainer. As Johan mentioned, there are also people in Utrecht who are interested in contributing with development or even willing to take over maintainership if necessary. Probably it is best to let Johan&co say more on Thursday :). On a personal note, I prefer to not volunteer as a maintainer of SYB since I have other personal priorities at the moment (focus on writing thesis, searching for jobs), which also explains my slow reactions to emails. Cheers, Alexey

Hi Neil, personally, I think that historical preservation of a reference approach is quite a different issue (hosting a copy of the Syb page at haskell.org, and a copy of the library on hackage, would be a start), which might also need a maintainer at some point, but Syb1+2 is part of base right now, and available from Ralf's old pages. So far, it does seem as if most of the items I'm interested in can be done on top of Data.Generics, simply bypassing the higher-level API and providing an alternative, but very close, higher-level API on top of the same low-level API (though I sometimes wonder whether gfoldl's second parameter should be generic rather than polymorphic). But what it comes down to is that I'd like to start with a working, well supported approach (Syb 1+2), and see whether we can close any of the known gaps without starting from scratch with yet-another- generics-library. And if that means morphing what is there into something even more useful, by taking inspirations from Syb 4 or Uniplate or .., I'm all for it, as long as it is a continuous evolution supported by evidence, not a heart-liver-and-lung transplant supported by hope.
1) Speed improvements, if possible
What I'm working on are mostly more convenient access to better performance in the higher-level API (traversal schemes), reducing the need for hand-tuned traversals using the low-level API directly. The part inspired by Uniplate's PlateData seems to be working, the part about replacing nested typecases with Map lookup is currently burried in other effects.
2) API tweaks, maybe a few extra functions (universe equivalent would be nice)
It seems we've got fmap and traverse defineable in terms of Data/ Typeable, so one could derive the latter two, then get the former for free, so to speak. But should these be in some Data.Generics.Utils, or should they move into Data.Traversable, etc (which already has some default functions for defining instances of one class in terms of another)? And if you mean {-# LANGUAGE ScopedTypeVariables #-} {-# LANGUAGE RankNTypes #-} import Data.Generics universe :: forall a . Data a => a -> [a] universe = everything (++) ([] `mkQ` child) where child = return :: a -> [a] then we're probably just talking about better performance from naive definitions (1) again, or is anything else wrong with that definition?
3) Make it work with Hugs - I've always been surprised that SYB doesn't work with Hugs, and I don't think its that much work.
Hmm, I'm still fond of Hugs, but I haven't used it much recently, so I'd be the wrong person for that job. On casual glance, I can't even think of any Syb-essential language features that aren't supported in Hugs (apart from deriving Data/Typeable), and my old WinHugs doesn't seem to have/support a cpp, which gets in the way of just loading the code - why doesn't Syb work with Hugs? 4) Improve useability Things like 'typeOf1' not working with 'Data a => a', or 'gzipWithT' giving fun type errors unless we eta-expand its first parameter, Syb as a general test-bed for the quality of type error messages, etc. Here, I'm not thinking of immediate cure-alls, but of collecting the various issues, creating tickets and/or a Wiki page and looking for ways out, step by step. One item I haven't mentioned yet: can't we replace the gensym-based TypeRepKeys with something more systematic/standardised? I've wondered about this on previous occasions, to make TypeRepKeys more portable, but it would also be nice just to get rid of that IO tag (reminding us that the keys may change with each program run). Thanks for your confidence, but I'll probably just collect feedback here, contribute my code/docs when I've got everything together (would a separate syb-utils package be preferred, or direct changes to base?) and move on. I look forward to your comments, though, when you get out of that tent!-) Claus
participants (6)
-
Alexey Rodriguez
-
Claus Reinke
-
Johan Jeuring
-
Mads Lindstrøm
-
Neil Mitchell
-
Simon Peyton-Jones