GHC source code improvement ideas

Hello GHC people, I was trying my hand at some GHC hacking, and I couldn't help but notice that much of the code looks (IMHO) horrible. There are some things that look as if they haven't been touched since the Haskell 1.3 days. The most striking is the use of `thenXX` instead of do notation. I was thinking of cleaning up some of this. In particular: 1a. Use do notation where possible instead of `thenXX`. 1b. Even better, use Applicative where possible. For example ds_type (HsFunTy ty1 ty2) = dsHsType ty1 `thenM` \ tau_ty1 -> dsHsType ty2 `thenM` \ tau_ty2 -> returnM (mkFunTy tau_ty1 tau_ty2) Could be rewritten as ds_type (HsFunTy ty1 ty2) = mkFunTy <$> dsHsType ty1 <*> dsHsType ty2 2. Investigate the need for all these mappM functions. To quote the source: "Standard combinators, but specialised for this monad (for efficiency)". Wouldn't a SPECIALIZE pragma work just as well? 3. In general, use standard functions when they are available. Currently GHC has its own sort function and its own versions of FiniteMap/Data.Map and Text.PrettyPrint. Using standard functions/data structures will a) make GHC smaller and therefore easier to maintain, and b) will make the code easier to understand for someone who knows the standard library. 4. A more radical change would be introducing hierarchical modules. This could be just a matter of renaming the directories to start with an upper case character and changing the import declarations. This gives module names like "Typecheck.TcGadt". The tc is redundant here, so perhaps it should be renamed to "Typecheck.Gadt" or "Typecheck.GADT". Perhaps even better would be "GHC.Typecheck.GADT", this way some modules can be exposed as part of the GHC api. How do the GHC developers feel about this? Would such paches be gladly excepted? Or will they be directly forwarded to the bit bucket? Twan

Twan, In general I'd be delighted for you to clean up GHC's code -- thank you! There is a slight down-side which is that because you end up touching a lot of code, you get a big patch that means you can't selectively pick patches from before and after the change -- but I'm not too bothered about that, and its balanced by the clean-up. Let's also see what Simon and Ian (or others) have to say. Best to do one change per patch. A very useful thing to do would be to improve the Commentary: http://hackage.haskell.org/trac/ghc/wiki/Commentary In fiddling with GHC's codebase you will surely learn stuff that you wish you'd known to start with, and that is the Ideal Moment to capture it in writing on the Wiki. A real service to others. | 1a. Use do notation where possible instead of `thenXX`. Yes. | 1b. Even better, use Applicative where possible. For example | | ds_type (HsFunTy ty1 ty2) | = dsHsType ty1 `thenM` \ tau_ty1 -> | dsHsType ty2 `thenM` \ tau_ty2 -> | returnM (mkFunTy tau_ty1 tau_ty2) | | Could be rewritten as | | ds_type (HsFunTy ty1 ty2) | = mkFunTy <$> dsHsType ty1 <*> dsHsType ty2 This works great when you have the pattern above -- but often it's a bit more complicated and then it does not work so well. So it's hard to employ the pattern consistently. Then you have to ask whether a mixture of styles is good. My gut feel is that it's better to use do-notation consistently. | 2. Investigate the need for all these mappM functions. To quote the | source: "Standard combinators, but specialised for this monad (for | efficiency)". Wouldn't a SPECIALIZE pragma work just as well? The trouble is that currently you can only specialise a function at its definition site -- and at the definition of mapM the relevant monad isn't in scope because mapM is in a library. I have no clue whether this has any effect on performance. You could try a small experiment, by defining mappM=mapM and seeing if there was any change. | 3. In general, use standard functions when they are available. Currently | GHC has its own sort function and its own versions of FiniteMap/Data.Map and | Text.PrettyPrint. Using standard functions/data structures will a) make GHC | smaller and therefore easier to maintain, and b) will make the code easier | to understand for someone who knows the standard library. In general yes. But GHC is meant to be compilable with any GHC back to 6.2. So if the interface to a library changes between (say) 6.2 and 6.4, you end up with ifdefs. That can be a royal pain. Having the library code in the compiler source eliminates that problem. The other thing is that it's possible that GHC's version of the library has more functions. Or uses some non-standard feature like unboxed types. So proceed with caution here. Prettyprinting would be a strong candidate. FiniteMap too. But be careful with UniqFM -- it's a very very heavily-used library. | 4. A more radical change would be introducing hierarchical modules. This | could be just a matter of renaming the directories to start with an upper | case character and changing the import declarations. This gives module names | like "Typecheck.TcGadt". The tc is redundant here, so perhaps it should be | renamed to "Typecheck.Gadt" or "Typecheck.GADT". Perhaps even better would | be "GHC.Typecheck.GADT", this way some modules can be exposed as part of the | GHC api. I don't think I'd object, but I'm not sure that much is gained here. We don't spend much time on mondule-name clashes. Simon

Simon Peyton-Jones wrote:
| 4. A more radical change would be introducing hierarchical modules. This | could be just a matter of renaming the directories to start with an upper | case character and changing the import declarations. This gives module names | like "Typecheck.TcGadt". The tc is redundant here, so perhaps it should be | renamed to "Typecheck.Gadt" or "Typecheck.GADT". Perhaps even better would | be "GHC.Typecheck.GADT", this way some modules can be exposed as part of the | GHC api.
I don't think I'd object, but I'm not sure that much is gained here. We don't spend much time on mondule-name clashes.
I vote for this, as it makes it easier to progress to standard module-chasing such as Cabal or `ghc --make`, and everyone supports hierarchical modules nowadays. (I might do it myself, if appropriate) warning: calling them "GHC.*" clashes with base's use of that namespace for implementation-specific functions in compiled programs, INCLUDING ghc (I mean, GHC code sometimes says "import GHC.Exts") ... so let's please find a different name-prefix if we decide we want one. ~Isaac

On Fri, Jan 04, 2008 at 08:34:22AM +0000, Simon Peyton-Jones wrote:
| 4. A more radical change would be introducing hierarchical modules. This | could be just a matter of renaming the directories to start with an upper | case character and changing the import declarations. This gives module names | like "Typecheck.TcGadt". The tc is redundant here, so perhaps it should be | renamed to "Typecheck.Gadt" or "Typecheck.GADT". Perhaps even better would | be "GHC.Typecheck.GADT", this way some modules can be exposed as part of the | GHC api.
I don't think I'd object, but I'm not sure that much is gained here. We don't spend much time on mondule-name clashes.
One point is that GHC uses module names like Util and State, which is a bit unfriendly to people using GHC-as-a-library. Also, although "import Types.Generics" is a bit longer to type, it is also a bit easier for someone less familiar with the code to follow the imports (if their editor does not help them). It's a pity that GHC.* is already used in base. I'm not sure what the best thing to do is in the short term. Thanks Ian

Ian Lynagh wrote:
On Fri, Jan 04, 2008 at 08:34:22AM +0000, Simon Peyton-Jones wrote:
| 4. A more radical change would be introducing hierarchical modules.
It's a pity that GHC.* is already used in base. I'm not sure what the best thing to do is in the short term.
How about Language.Haskell.Compiler.GHC.* In the long term, Haskell needs a better module system IMHO, since the problem at the moment is that having to write the full hierarchical name in each module means that you need to know in advance, before you've even started writing a program, where each module will fit into the global namespace, which makes it extraordinarily difficult to do exploratory programming or bottom-up development, and the need to write so many import directives in each module makes it extremely painful not to mention overly hard-wired. A good direction IMHO is the way SML structures are grouped using CM.make (SML/NJ) or MLBasis (MLton and others), so that there are no import directives at all: a separate group definition determines what is in scope, and compilation proceeds by chasing these group definition files rather than chasing modules. Translating this into Haskell, the module syntax could be simplified and other modules could be nested and aliased within a module, so that it is easy to create different hierarchical views of the modules which are in scope to suit the particular module being written or clients of that module: -- Leaf.hs module Leaf where module String where open Data.ByteString.Char8 type T = ByteString module Map = Data.Map module Util where foo :: Int -> Int foo i = i --Other.hs module Other where bar :: Int -> Leaf.String.T bar = Leaf.String.pack . show . Leaf.Util.foo Note that here there is no need for any import directives, since the modules which are in scope when Leaf.hs is compiled would be determined by whatever group Leaf.hs is part of (with respect to that particular program), which would be defined in a separate file: --MyBasis.hsg local $Haskell/StdBase.hsg Leaf.hs Other.hs Within the .hsg files, groups and modules are referenced by filename, and is just a simple list of components that are required and/or exported by the group. In the .hsg file above, MyBasis will export the contents of Leaf.hs and Other.hs, which are compiled in an environment augmented by StdBase (which is not itself exported by MyBasis). (See CM.make and the MLBasis system for more details - in particular, for any given program a module may appear in only one group, but the same module may appear in different groups in different programs thus facilitating easy exploratory programming and re-organization without breaking previous programs.) This can be taken further to get an even more powerful system by using parameterized traits and explicit instantiation for modules: trait T a b where foo :: a -> b bar :: Int -> Int bar i = i + 1 module M where include T Int String foo = show . bar Here, the body of a module is always a trait, and the above is equivalent to: trait T a b = tra foo :: a -> b bar :: Int -> Int bar i = i + 1 module M = new tra include T Int String foo = show . bar which makes it more explicit that conversion of the contents to actual code (ie linking, allocation/identity/initialization of CAFs and foreign objects, generativity of data types etc) happens only in the module decl. The great thing about making instantiation explicit is that traits are pure functional values and so can easily be combined with no side effects, whereas modules may in general contain mutable state eg to interface with some external C library and thus are objects-with-identity. Thus the separation solves the whole problem of applicative vs generative functors in ML, as well as solving the problem of mutually recursive structures (functors become a redundant concept because you can always start with including a trait then overriding some decls with more instantiated (here used in a different sense) decls and/or traits). Last but not least, a trait could also be used similarly to a signature, except that definitions in the trait can be considered to be default implementations. Therefore in this scenario the body of a class or instance decl is just a trait and so could be composed using other traits etc. (Each instance decl instantiates the trait given in the body merged with the trait given in the class leading to a possibly parameterized module.) Anyway these are some of the directions I am currently working on for my new language which is a strict version of Haskell/ML but where explicit type annotations drive name resolution rather than explicit namespace annotations driving type inference. Related work for the above version of traits/mixins includes the Scala language and the approach described in "Evolving Software with Extensible Modules" by Matthias Zenger. Regards, Brian. -- www.metamilk.com

Brian Hulley wrote:
Ian Lynagh wrote:
On Fri, Jan 04, 2008 at 08:34:22AM +0000, Simon Peyton-Jones wrote:
| 4. A more radical change would be introducing hierarchical modules.
Sorry I quoted what Simon PJ replied to not what he wrote so I should have written:
Ian Lynagh wrote:
On Fri, Jan 04, 2008, Twan van Laarhoven wrote: 4. A more radical change would be introducing hierarchical modules.
Though this is again not quite right since Ian did not write line 3... Apologies to Simon PJ and Twan (and now also to Ian) for mis-quoting ;-) Brian.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Brian Hulley wrote:
In the long term, Haskell needs a better module system IMHO [...]
I agree with the symptoms that were described, but I draw a different conclusion. We don't need to change the language - we need better tools (IDEs) that operate not at the textual level, but use the annotated syntax tree, including knowledge on fully qualified names and types. E.g. Java requires to write full package names everywhere, but this is never a problem when working with Eclipse (or any other modern IDE I assume) since it just knows how to rename a package or move a class etc. Try moving some Haskell functions/declarations by hand (e.g. with Emacs) into a different, or a new module. It's a nightmare until you get all the imports and qualified usages right. But it really is conceptually simple, so it should be a one-click operation. Since it's this time of the year, here's my wish for 2008: a version of HaRe (Haskell Refactorer) that knows current ghc language extensions *and is integrated with Eclipse* The way to go seems this: http://eclipsefp.sourceforge.net/cohatoe/ and I dearly hope there will be more progress on that. Best regards, Johannes Waldmann, Leipzig. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4-svn0 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHgAR23ZnXZuOVyMIRAvYsAJ9DRhuJ3Nxyn+L+elwij5YmqWw89wCgrtLX tBgzlnLUjkNWqPkZgCykLuk= =I3TR -----END PGP SIGNATURE-----

Johannes Waldmann wrote:
Brian Hulley wrote:
In the long term, Haskell needs a better module system IMHO [...]
I agree with the symptoms that were described, but I draw a different conclusion.
We don't need to change the language - we need better tools (IDEs) that operate not at the textual level, but use the annotated syntax tree, including knowledge on fully qualified names and types.
Hi Johannes, I agree that better tools would be a good thing, and could improve the programming experience considerably, but I still think that an improved module system design would be an orthogonal benefit. Best regards, Brian.

Twan van Laarhoven wrote:
Hello GHC people,
I was trying my hand at some GHC hacking, and I couldn't help but notice that much of the code looks (IMHO) horrible. There are some things that look as if they haven't been touched since the Haskell 1.3 days. The most striking is the use of `thenXX` instead of do notation.
I was thinking of cleaning up some of this. In particular:
1a. Use do notation where possible instead of `thenXX`.
Yes.
1b. Even better, use Applicative where possible. For example
ds_type (HsFunTy ty1 ty2) = dsHsType ty1 `thenM` \ tau_ty1 -> dsHsType ty2 `thenM` \ tau_ty2 -> returnM (mkFunTy tau_ty1 tau_ty2)
Could be rewritten as
ds_type (HsFunTy ty1 ty2) = mkFunTy <$> dsHsType ty1 <*> dsHsType ty2
No, IMO. This just adds another abstraction that a potential GHC contributor has to learn about. Including me :-) It doesn't hurt to write out the code in full sometimes, and as the GHC coding style guidelines say: Remember: other people have to be able to quickly understand what you've done, and overuse of abstractions just serves to obscure the really tricky stuff, and there's no shortage of that in GHC. http://hackage.haskell.org/trac/ghc/wiki/Commentary/CodingStyle (I think I might be quoting myself there, in which case it doesn't really add much credibility to my argument :). Also bear in mind that we have to be able to compile GHC with older versions of itself (currently back to 6.2.2), which explains why we're often conservative with respect to using external libraries.
2. Investigate the need for all these mappM functions. To quote the source: "Standard combinators, but specialised for this monad (for efficiency)". Wouldn't a SPECIALIZE pragma work just as well?
It does look like we could do away with some of those, yes.
3. In general, use standard functions when they are available. Currently GHC has its own sort function and its own versions of FiniteMap/Data.Map and Text.PrettyPrint. Using standard functions/data structures will a) make GHC smaller and therefore easier to maintain, and b) will make the code easier to understand for someone who knows the standard library.
In general yes. But bear in mind there's another principle at work here: we want to insulate GHC as much as possible from its environment, to reduce the chances of spurious failures or performance regressions. I believe our version of Text.PrettyPrint has diverged from the library version - last time I looked it wasn't obvious that we could replace it. Regarding Data.Map, I'd be interested in trying out AVL trees instead, to see if they have better performance, but then we'd have to import the code of course.
4. A more radical change would be introducing hierarchical modules. This could be just a matter of renaming the directories to start with an upper case character and changing the import declarations. This gives module names like "Typecheck.TcGadt". The tc is redundant here, so perhaps it should be renamed to "Typecheck.Gadt" or "Typecheck.GADT". Perhaps even better would be "GHC.Typecheck.GADT", this way some modules can be exposed as part of the GHC api.
Yes, I've wondered about doing this in the past. It does mean more typing, though: clearly 'import BasicTypes.Id' is more painful than just 'import Id' and doesn't add much, so IMO we'd need to put some thought into a structure that makes sense.
How do the GHC developers feel about this? Would such paches be gladly excepted? Or will they be directly forwarded to the bit bucket?
Within the constraints I've outlined above, cleanups are definitely welcome. Another worthwhile gardening-type activity is to go around elimianting warnings. Many modules have {-# OPTIONS -w #-} at the top, waiting for someone to go in and clean up all the warnings. They do catch real bugs, so this is not a waste of time by any means. In addition to this, things like - identifying and removing dead code - commoning up duplicate code fragments - improving test coverage - profiling and optimising hotspots are all useful tasks. Cheers, Simon

Simon Marlow wrote:
Regarding Data.Map, I'd be interested in trying out AVL trees instead, to see if they have better performance, but then we'd have to import the code of course.
Surely, we want the best maps possible for ghc and as public library (and minimize maintenance). The problem is to agree on a suitable interface. I would suggest to take (or only slightly change) Daan's interface (the current Data.Map) and improve the underlying implementation possibly using (i.e. Adrian Hey's) AVL trees. Christian

Christian Maeder wrote:
Simon Marlow wrote:
Regarding Data.Map, I'd be interested in trying out AVL trees instead, to see if they have better performance, but then we'd have to import the code of course.
Surely, we want the best maps possible for ghc and as public library (and minimize maintenance).
The problem is to agree on a suitable interface. I would suggest to take (or only slightly change) Daan's interface (the current Data.Map) and improve the underlying implementation possibly using (i.e. Adrian Hey's) AVL trees.
The trouble is as far as implementations is concerned the best maps possible is a continually moving target I suspect, not to mention being highly dependent on key type. I certainly wouldn't say AVL tree based Maps are best possible, though they do seem give better performance (particularly for union, intersection etc..). The clone also address some defects in the current Data.Map (like lack of strictness control) and has some other useful stuff. But if this is going to be used at all, I would say this is only a stop gap solution, which leads me to your second point about interfaces. The generalised trie lib I was working on was a serious attempt to see what a real useable non-toy map API should look like. It's the GT class here.. http://code.haskell.org/collections/collections-ghc6.8/Data-Trie-General/Dat... (Sorry for the long URL). It's already somewhat huge and still incomplete IMO, but even in it's current form it gives a uniform API for Ints, arbitrary Ord instances and Lists. It's a shame it's all completely untested :-( What really needs to happen on this is.. 1 - "Finish" and stabilise the GT class definition. There's still more that's needed but I think the promised type families magic is needed first. 2 - Write a comprehensive test/benchmarking suite for GT instances. 3 - Provide some way to automatically generate the instances for arbitrary user defined types. Which is all rather a lot of work that nobody seems very interested in :-( Regards -- Adrian Hey

Wolfgang Jeltsch wrote:
Am Sonntag, 6. Januar 2008 13:37 schrieb Adrian Hey:
It's the GT class here..
Short remark: Wouldn’t a longer, more descriptive identifier be better?
Like "GeeTee" maybe? or even "GeneralisedTrie"? I like short names myself. But as I have stopped work on this particular lib, it needs a new owner. Anyone who takes it on has renaming priveleges :-) Regards -- Adrian Hey

Hi Yhc did a few of these things, and our experiences were:
1a. Use do notation where possible instead of `thenXX`.
If it is that simple, yes. Certainly Yhc uses "super-monads" which made this a bit painful. You should probably step this by getting it to the stage where thenXX = >>=, then only flipping to do notation once the things are already monads.
1b. Even better, use Applicative where possible. For example
ds_type (HsFunTy ty1 ty2) = dsHsType ty1 `thenM` \ tau_ty1 -> dsHsType ty2 `thenM` \ tau_ty2 -> returnM (mkFunTy tau_ty1 tau_ty2)
Could be rewritten as
ds_type (HsFunTy ty1 ty2) = mkFunTy <$> dsHsType ty1 <*> dsHsType ty2
I still don't particularly like applicative, it is a bit confusing to me. If you moved to using the Uniplate library, you could then write: descendM dsHsType Shoter, more concise, and probably works for more cases than just HsFunTy - eliminating even more code. I suspect moving to Uniplate would give GHC massive benefits.
4. A more radical change would be introducing hierarchical modules. This could be just a matter of renaming the directories to start with an upper case character and changing the import declarations. This gives module names like "Typecheck.TcGadt". The tc is redundant here, so perhaps it should be renamed to "Typecheck.Gadt" or "Typecheck.GADT". Perhaps even better would be "GHC.Typecheck.GADT", this way some modules can be exposed as part of the GHC api.
We did this in Yhc, and it was really useful just to group files into the same directory. Thanks Neil
participants (11)
-
Adrian Hey
-
Brian Hulley
-
Christian Maeder
-
Ian Lynagh
-
Isaac Dupree
-
Johannes Waldmann
-
Neil Mitchell
-
Simon Marlow
-
Simon Peyton-Jones
-
Twan van Laarhoven
-
Wolfgang Jeltsch