Libraries and hierarchies

What follows is a proposal from Simon P.J. and myself, for solving some of the problems that have arisen with hierarchical modules in Haskell. We think this is quite a nice solution: it decentralises the allocation of names, allows versioning of libraries and referencing libraries by GUID, allows relative module names, and allows moving modules within the hierarchy without changing the source code, all without adding any extra syntax :-) Please tell us what you think. This will only get adopted if everyone agrees, because it needs all the Haskell implementations on board in order to work. THE PROBLEM ----------- Problem 1: Allocating names in the hierarchy At the moment, we have a scheme of central registration (albeit informally on this list), along with a way for users to name libraries based on their email address (eg. User.Com.Microsoft.Simonmar.Foo). This is unsatisfactory, because (a) having a central registry for names is too cathedralish, and (b) it's inconvenient to use the email-address form because it gives rise to overly long module names. Problem 2: Moving a module tree around Suppose you have a tree of modules Control.Monad, Control.Monad.X, Control.Monad.Y etc, and you want to move them from Control to some other place Foo.Baz in the hierarchy, to give Foo.Baz.Monad, Foo.Baz.Monad.X, etc. At the moment you have to visit every module and change its module header to give the correct absolute path name. Problem 3: Long module names in imports It's plain tiresome to have to write import User.Simon.Text.PrettyPrint.HughesPJ Long path names, repeated all over the source tree, are painful. They are particularly painful when you want to refer to another module in the same library -- then, if you decide to put the library somewhere else, you have to change all its internal imports. A POSSIBLE SOLUTION ~~~~~~~~~~~~~~~~~~~ The key idea is this: there is no longer a single global hierarchy of modules, but every site and every user has the means to populate their own module hierarchy as they see fit, with off-the-shelf and local libraries. This means that when you install a package (a sub-hierarchy of modules), you get to choose where in the global hierarchy on your system it is rooted. There would probably be a default, which you would most often go along with unless it clashes with another library on your system, in which case you might choose to site it somewhere else. You can even choose to site it in several places in the tree. If you want several versions of a library installed on your system, you can do that too, provided you site them at different places in the hierarchy. If you install a new version of a library, just re-site the old one to a version-specific place, and install the new one. eg. I install the GTK+HS library on my system. By default, it sites itself under Graphics.UI.Gtk.* Graphics.UI.Gtk.V0-15.* and possibly additionally sites itself under a GUID-based root, or one based on an API hash: GUID_XXXXXXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.* When I install a new version of GTK+HS, say version 0.16, I can replace the existing Graphics.UI.Gtk with the new version, and I now have: Graphics.UI.Gtk.* -- now refers to version 0.16 Graphics.UI.Gtk.V0-15.* Graphics.UI.Gtk.V0-16.* and two distinct GUID sites. You could even use this mechanism, in a per-user configuration file say, to site the GTK+HS library directly in Gtk.*, if you're too lazy to type Graphics.UI all the time. This wouldn't be recommended though: any source code you write won't compile on someone else's system that doesn't have the same convention. PACKAGES and DISTRIBUTION ~~~~~~~~~~~~~~~~~~~~~~~~~ A "package" is the unit of distribution. A package includes a sub-hierarchy of Haskell modules, as well as perhaps other stuff (such as C header files etc). When installing a package one specifies one or more "sites" at which the modules are to be grafted into the module hierarchy. A site is just the module prefix to be used for modules in that package. The sub-hierarcy within a package is not changed by re-siting. A package comes with a list of "default sites", where it will be installed by default. It is an install-time error to graft in a package at a site that means that a single module name is defined twice. The actual representation of a package in distributable form may vary between implementations. For example, Hugs may need only source files, while GHC may distribute interface files and binaries. All that matters is that for any particular implementation (Hugs, say) there's a specified way to take a package and install it at one or more sites in that Hugs-compiler's module hierarchy. For example, GHC's package configuration for an installed package currently looks something like this: { name = "mylib", import_dirs = ["/usr/local/lib/mylib/imports"], ... } and we suggest adding an extra field, sites: { name = "mylib", import_dirs = ["/usr/local/lib/mylib/imports"], sites = [ "Foo.Bar.MyLib", "Foo.Bar.MyLib.V2.3", "GUID_XXXX" ] ... } SOURCE CODE ~~~~~~~~~~~ What about module names in the source code, and how are modules compiled? Suppose I am compiling a package whose default site is Foo.Bar, and containing modules Foo.Bar.A.B, and Foo.Bar.A.C (assuming it is installed at the default site). I put the source code in A/B.hs and A/C.hs, and the code would look like this: module A.B where import A.C The implementation must obey the following rule: When compiling a module belonging to a package, that package is temporarily grafted into the root of the module hierarchy. This means that 'import A.C' will find the module A.C from the package being compiled. If there is already a global module A.C, the package module "wins"; so the global module A.C is inaccessible. (There could be some extra mechanism to get around this, if it seems important.) Modules in other packages can be imported only by uttering their full path names in the global hierarchy (of the compiler that is compiling the package). After installing the library, the tree of modules it contains will be grafted into the global hierarchy at possibly many places, and the modules can then only be imported by uttering their full path names in the global hierarchy. Alternative design: modules in the current package could be specified explicitly, perhaps by prefixing them with '.'. This would avoid the possibility of overlap between the current package and the global hierarchy, at the expense of having to add lots of extra '.'s. IMPLEMENTATION ~~~~~~~~~~~~~~ How do we implement this? For Hugs, it should be relatively straightforward; two implementations spring to mind. Either (a) transform the source files as they are installed, to replace package-relative module names with absolute names. Multiply-sited packages are implemented by copying and transforming the source code into several places. (b) have Hugs do the module fixup at load-time, and change the search strategy to take into account package-relative imports. Multiply-sites packages can be done with symbolic links, or straight copying of source files. For GHC and other systems with compiled libraries, it's a bit trickier. We need to make sure that each symbol in the compiled library cannot clash with any symbol in any other library. One way to do this is to include the package name in each symbol, and require that package names are unique (perhaps include a GUID in a package name).

confluence! Inspired by some posts the other day I implemented an immutable GUID naming scheme for interfaces with Haskell. I thought it would require compiler support, but it turns out a good 90% solution can be done as a separate program quite nicely and there are not too tough workarounds for the other 10%. http://repetae.net/john/computer/haskell/hsguid/ This is not really the same as your proposal, but is somewhat related and addresses many of the same issues. (and might be a bit lighter weight) actually, they would probably mix well as your proposed system could just use the same GUID pragmas hsguid uses as input. Let me know what y'all think. library developers, add a GUID to your code. it's easy and takes no resources, but could save users of your library a lot of hassle. :) John -- --------------------------------------------------------------------------- John Meacham - California Institute of Technology, Alum. - john@foo.net ---------------------------------------------------------------------------

[stuff about GUIDs]
Careful! There are two things one might tie GUIDs to.
You could compute a hash of (or associate a GUID with) the *interface*,
or the *implementation*. These are different things. Until we know
which we want, we should support both.
Why?
One might reasonably say "this program needs version X of the
interface". That is, we don't care how it's implemented, but it had
better export these functions at these types.
But one might also reasonably say "this program needs version Y of the
implementation". That is, it requests a particular *implementation* -
maybe it depends on certain bugs that are fixed in that version (or are
not fixed!). Or it's only been tested against that implementation, and
it wants to provide some certification guarantees ("This software is
DoD-certified to give appropriate results with specified inputs").
A side point is that GUIDs can be generated in two ways:
1. The traditional way: make up a 128-bit random number and insert it
into the interface or implementation (as appropriate). Use this as the
name for that thing.
2. The nifty way: have the compiler compute a hash (SHA-1 or MD5) of
the interface or implementation (as appropriate). Use this as the name
for that thing.
Option (2) has the advantage that it's one step shorter, and it's
safer: with (1), you can generate one GUID but accidentally use it on
two distinct interfaces / implementations; with (2), this is
(essentially) impossible (although it may be possible to achieve a
collision intentionally; malice is a separate issue that should be
addressed in other ways).
For (2), we need to agree what to hash. The options are basically "the
source text of the module" (or some sub-part of it for the interface
case), or "the abstract syntax tree of the module". The latter is
probably nicer, but requires some agreement between compiler writers if
it is to be valid across compilers.
For background to this discussion, see our forthcoming ICFP paper,
James J. Leifer, Gilles Peskine, Peter Sewell, Keith Wansbrough (2003).
Global Abstraction-Safe Marshalling with Hash Types
which is available at
http://www.cl.cam.ac.uk/~kw217/research/paper-abstracts.html#Leifer*03:G
lobal
Comments?
--KW 8-)
--
Keith Wansbrough

On Fri, Aug 01, 2003 at 05:00:55PM +0100, Keith Wansbrough wrote:
[stuff about GUIDs]
Careful! There are two things one might tie GUIDs to.
You could compute a hash of (or associate a GUID with) the *interface*, or the *implementation*. These are different things. Until we know which we want, we should support both. yes, which is why each library can export an arbitrary number of GUIDs. Oddly enough, I considered doing what you mentioned below (a hash of the interface). but chose not to so that the same interface can be exported under several GUIDs. this reflects the fact that there could be hidden assumptions (such as relying on bugs) in interfaces which should be reflected in the GUID. I also wanted the tool to have simple semantics relative the haskell base language. thinking of it as modules just having multiple Module declarations and export lists should make it easy to adopt.
plus, this was much easier to implement as a first stab at the problem :)
One might reasonably say "this program needs version X of the interface". That is, we don't care how it's implemented, but it had better export these functions at these types.
But one might also reasonably say "this program needs version Y of the implementation". That is, it requests a particular *implementation* - maybe it depends on certain bugs that are fixed in that version (or are not fixed!). Or it's only been tested against that implementation, and it wants to provide some certification guarantees ("This software is DoD-certified to give appropriate results with specified inputs").
The idea then is that you would hsguid -g twice to generate an implementation ID and an interface id. old implementations could be archived and their interface id deleted.
A side point is that GUIDs can be generated in two ways:
1. The traditional way: make up a 128-bit random number and insert it into the interface or implementation (as appropriate). Use this as the name for that thing.
2. The nifty way: have the compiler compute a hash (SHA-1 or MD5) of the interface or implementation (as appropriate). Use this as the name for that thing.
Yeah, that might be a good way to do it if I could have compiler support, but I would need to be able to append some 'salt' to take care of hidden assumptions not reflected in the raw interface. I actually implemented something like this for datatypes, based on XDR... hmm.. http://haskell.org/pipermail/haskell-cafe/2001-September/002215.html
Option (2) has the advantage that it's one step shorter, and it's safer: with (1), you can generate one GUID but accidentally use it on two distinct interfaces / implementations; with (2), this is (essentially) impossible (although it may be possible to achieve a collision intentionally; malice is a separate issue that should be addressed in other ways).
eh. I don't think the random collision is a real concern, at worst, hsguid spits out an error and you angrily email the authors for doing the equivalant of winning the lottery 10 times in a row :) problem resolved just by copying the offending pragma with a new id and use that one.
For (2), we need to agree what to hash. The options are basically "the source text of the module" (or some sub-part of it for the interface case), or "the abstract syntax tree of the module". The latter is probably nicer, but requires some agreement between compiler writers if it is to be valid across compilers.
what I did was implement a Hash class, which had a single function which returned a hash, the hash was constant if the type was a terminal, and its type hashed with all its childrens types if it was a node. recursive types were detected by drift and explicitly broken. although probably for this, the easiest thing would be to hash the cannonicalized export list (with types). but not for hsguid. but perhaps for some other tool which does a better job of generating hsguid compatable pragmas.
For background to this discussion, see our forthcoming ICFP paper,
James J. Leifer, Gilles Peskine, Peter Sewell, Keith Wansbrough (2003). Global Abstraction-Safe Marshalling with Hash Types
which is available at
http://www.cl.cam.ac.uk/~kw217/research/paper-abstracts.html#Leifer*03:G lobal
ooh. looks interesting. will read. I have really wanted something like this for haskell for a while. John -- --------------------------------------------------------------------------- John Meacham - California Institute of Technology, Alum. - john@foo.net ---------------------------------------------------------------------------

hello, Simon Marlow wrote:
... THE PROBLEM -----------
Problem 1: Allocating names in the hierarchy
At the moment, we have a scheme of central registration (albeit informally on this list), along with a way for users to name libraries based on their email address (eg. User.Com.Microsoft.Simonmar.Foo). This is unsatisfactory, because (a) having a central registry for names is too cathedralish, and (b) it's inconvenient to use the email-address form because it gives rise to overly long module names.
Problem 2: Moving a module tree around
Suppose you have a tree of modules Control.Monad, Control.Monad.X, Control.Monad.Y etc, and you want to move them from Control to some other place Foo.Baz in the hierarchy, to give Foo.Baz.Monad, Foo.Baz.Monad.X, etc. At the moment you have to visit every module and change its module header to give the correct absolute path name.
Problem 3: Long module names in imports
It's plain tiresome to have to write import User.Simon.Text.PrettyPrint.HughesPJ Long path names, repeated all over the source tree, are painful. They are particularly painful when you want to refer to another module in the same library -- then, if you decide to put the library somewhere else, you have to change all its internal imports. i think you explained the problems quite nicely.
A POSSIBLE SOLUTION ~~~~~~~~~~~~~~~~~~~ The key idea is this: there is no longer a single global hierarchy of modules, but every site and every user has the means to populate their own module hierarchy as they see fit, with off-the-shelf and local libraries.
This means that when you install a package (a sub-hierarchy of modules), you get to choose where in the global hierarchy on your system it is rooted. There would probably be a default, which you would most often go along with unless it clashes with another library on your system, in which case you might choose to site it somewhere else. You can even choose to site it in several places in the tree.
If you want several versions of a library installed on your system, you can do that too, provided you site them at different places in the hierarchy. If you install a new version of a library, just re-site the old one to a version-specific place, and install the new one.
eg. I install the GTK+HS library on my system. By default, it sites itself under
Graphics.UI.Gtk.* Graphics.UI.Gtk.V0-15.*
and possibly additionally sites itself under a GUID-based root, or one based on an API hash:
GUID_XXXXXXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.*
When I install a new version of GTK+HS, say version 0.16, I can replace the existing Graphics.UI.Gtk with the new version, and I now have:
Graphics.UI.Gtk.* -- now refers to version 0.16 Graphics.UI.Gtk.V0-15.* Graphics.UI.Gtk.V0-16.*
and two distinct GUID sites. i think the problem is that when you write a piece of software you don't know that in the future things may change in bad ways, so you are likely to always use "Graphics.UI.Gtk". Then later installing a new version of the library may break your program. I guess, at this point you could go and replace all the imports with others explicitly mentioning the version number. wouldn't it be more convinient, if one always refers to the library as "Graphics.UI.Gtk.*", but when compiling the program one gives the compiler a flag specifying which implementation of the package to use, e.g. something like:
ghc -package Gtk.V0-15 myprogram.hs or ghc -package Gtk.V0-16 myprogram.hs
...
SOURCE CODE ~~~~~~~~~~~ What about module names in the source code, and how are modules compiled? Suppose I am compiling a package whose default site is Foo.Bar, and containing modules Foo.Bar.A.B, and Foo.Bar.A.C (assuming it is installed at the default site). I put the source code in A/B.hs and A/C.hs, and the code would look like this:
module A.B where import A.C
The implementation must obey the following rule: When compiling a module belonging to a package, that package is temporarily grafted into the root of the module hierarchy.
This means that 'import A.C' will find the module A.C from the package being compiled. If there is already a global module A.C, the package module "wins"; so the global module A.C is inaccessible. (There could be some extra mechanism to get around this, if it seems important.)
Modules in other packages can be imported only by uttering their full path names in the global hierarchy (of the compiler that is compiling the package). i don't quite understand this part. how is a module going to import modules from another package, if it does not know where these packages are going to be installed? if i am writing a module M, which imports A.B.X.N which is module X.N from a package P that was installed at A.B, doesn't that force everyone who is trying to use my package to install P at the same place: A.B?
bye iavor -- ================================================== | Iavor S. Diatchki, Ph.D. student | | Department of Computer Science and Engineering | | School of OGI at OHSU | | http://www.cse.ogi.edu/~diatchki | ==================================================

Simon Marlow writes: : | After installing the library, the tree of modules it contains will | be grafted into the global hierarchy at possibly many places, and | the modules can then only be imported by uttering their full path | names in the global hierarchy. Hi. If that makes a module (module M where data T) appear twice in the global hierarchy (at A and B), do the types A.M.T and B.M.T become distinct from each other? Sorry if that's an obtuse question. I hope it has a short answer. Regards, Tom

"Simon Marlow"
What follows is a proposal from Simon P.J. and myself, for solving some of the problems that have arisen with hierarchical modules in Haskell.
We think this is quite a nice solution: [..]
Yes, I think, it is nice, too.
SOURCE CODE ~~~~~~~~~~~ What about module names in the source code, and how are modules compiled? Suppose I am compiling a package whose default site is Foo.Bar, and containing modules Foo.Bar.A.B, and Foo.Bar.A.C (assuming it is installed at the default site). I put the source code in A/B.hs and A/C.hs, and the code would look like this:
module A.B where import A.C
The implementation must obey the following rule: When compiling a module belonging to a package, that package is temporarily grafted into the root of the module hierarchy.
This means that 'import A.C' will find the module A.C from the package being compiled. If there is already a global module A.C, the package module "wins"; so the global module A.C is inaccessible. (There could be some extra mechanism to get around this, if it seems important.)
Modules in other packages can be imported only by uttering their full path names in the global hierarchy (of the compiler that is compiling the package).
After installing the library, the tree of modules it contains will be grafted into the global hierarchy at possibly many places, and the modules can then only be imported by uttering their full path names in the global hierarchy.
Alternative design: modules in the current package could be specified explicitly, perhaps by prefixing them with '.'. This would avoid the possibility of overlap between the current package and the global hierarchy, at the expense of having to add lots of extra '.'s.
I'd feel more comfortable with the alternative design. Relative and absolute names are different things; hence, it seems cleaner to distinguish them lexically. Moreover, when reading somebody else's code, it is helpful if it is easy to distinguish absolute from relative names (eg, to know where to look when trying to visit imported modules). Cheers, Manuel
participants (6)
-
Iavor Diatchki
-
John Meacham
-
Keith Wansbrough
-
Manuel M T Chakravarty
-
Simon Marlow
-
Tom Pledger