
Hi all, Malcolm Wallace wrote:
Hello to everyone who has joined the HaskellDoc mailing list. We had a little bit of discussion before announcing the list more widely, but everything now seems to have stopped dead. So it's time to get thoughts rolling again. Do check the list archive on haskell.org to see what has already happened.
I'll start by declaring my interest in automatic documentation.
Position statements seem to be order of the day (or week, maybe). So here are some points on what I believe a good standard should look like. 1. I agree with Malcolm and Jan Skibinski that the documentation conventions need to be lightweight. (I too dislike literate programming, except possibly when the aim is to write a paper or a book.) 2. I think the documentation standards should be able to support both internal and external documentation. 3. I believe a standardized, intermediate, "raw" documentation format would be useful. 4. I think the intermedite format should be based on XML. I will now discuss each point in turn. ----------------------------------------------------------------------- 1. I agree with Malcolm and Jan that the documentation conventions need to be lightweight. (I too dislike literate programming, except possibly when the aim is to write a paper or a book.) However, I think that relying solely on positional cues might be too constraining and (in te long run) inflexible. So personally, I think HDOc/JavaDoc-like tags is a good compromise. To that, I also see a need to add some lightweight conventions for markup of explanatory text. E.g. I'd like to be able to mark variable names, emphasize a piece of text, and maybe include small code fragments. Jan propses to use conventions like 'xxx' for variable names and "zzz" for emphasis (I think). That's probably reasonable, and I indeed use the 'xxx' convention in my own comments sometimes. But one should be aware that this useage can conflict with the normal meaning of the quote characters. In particular, other lightweight emphasis conventions like _yyy_ or *zzz* spring to mind. I would even find it acceptable with some more heavyweight conventions for marking entire paragraps, such as <code> and </code>. This would be very useful for including useage examples in external documentation, for instance. I would like the possibiility to include pictures (as opposed to having to rely on ASCII graphics). Take a look at the Fudgets documentation for examples showing how useful this can be. Finally, I do agree with Malcolm that XML is far to heavy to be used at this level. ----------------------------------------------------------------------- 2. I think the documentation standards should be able to support both internal and external documentation. By internal documentation, I mean documentation of the source code as such, intended for people who needs to read and understand source code such as developers and maintainers. By external documentation I mean documentation of interfaces, intended for people who needs to use a piece of software but who do not need to know about the internal details. I guess this mainly applies to library interfaces, but one could also consider manpage-style application documentaion (cf. POD from the Perl world). Since the markup needs for internal and external documentation are pretty similar, I don't think it will be very difficult to develop a standard supporting both. The main thing which has to be added is a way of declaring if a piece of documentation is for internal or external (or maybe both) use. Having two different commenting conventions (e.g. "{--" and "{---") would be a possibility. Another possibility, probably more flexible, is to have some initial tag. For external documentation, it may also be useful to have a possibility to generate documentation at different levels of detail. For instance, for a very large library, it might be uesful to have both brief beginner documentation, more extensive programmer documentation, and full documentation (e.g. including obsolete, deprecated features). Again, the Fudget documentation is a good example (and where I picked up the idea). It would seem as if a comment classification scheme based on initial tags easily could be adapted for this kind of use as well. Once the documentation comments have been classified, generating internal or external documentation is rally a tool issue. For internal documentation, a tool would basically just have to extract type signatures (or infer them), type definitions, class definitions, etc. along with all internal documentation comments. For proper external documentation, a good tool also has to take import and export into account. For instance, a library could be made up of a number of modules which are collected and re-exported by one single "top-level" module. The users are not supposed to have to know about the internal library structure, but only sees the one module. Thus, when generating documentation for this module, the tool would have to collect documentation for the re-exported entities from _other_ modules. ------------------------------------------------------------------------ 3. I believe a standardized, intermediate, "raw" documentation format would be useful. I've argued above that it would be desirable to support at least two different types of documentation. Furthermore, documentation could conceivably be rendered in a plethora of different formats: HTML, PDF, postscript, info, LaTeX, DocBook, etc. Different people and organizations may even have specialized formatting needs. For instance, assuming e.g. a HDoc/JavaDoc-like convention where the very first sentence of a doocumentation comment gives a synopsis, someone maintaining a collection of libraries (e.g. on haskell.org) might like a tool that extracts only this information for each library, so that someone browsing through the collection of libraries quickly can determine whether a particular library fits the bill or not. Or imagine an organization where all documentation has to conform to some strictly defined, internal standard. The possibilities are, if not endless, at least extensive. Add to this other applications such as searching through a library (or collection of libraries) based of type information (an old idea which often is quite useful, but sadly neglected in today's functional programming environments). All tools carrying out tasks like those suggested above share a common need: a (preferably easy) way to extract "meta" information from source code. For exaple: * Names of all exported entities (i.e. "canonical", fully expanded export list). * Origin info for exported entities not defined locally. * Names of all top-level entities defined in a module. * For types and classes, their definitions. * Type signatures for functions and method instances. * Author-supplied documentation associated with the various top-level entities. * Maybe source code positions, or at least the name of the file in which something is defined. * Fixity declarations. * Perhaps even strictness signatures. There are different ways to get such information. In some cases, simple matching based on regular expressions might be enough. Unfortunately, such solutions tend to be fragile, in particular for a language whith the lexical and syntactical conventions of Haskell (take nested comments, for one example). It is also unclear to what extent such solutions could be shared between different tools. Another approach would be to provide a (simplified, specialized) Haskell parser with a clearly defined interface making information like what was described above available. This would no doubt prove to be very popular for people wanting to develop various documentation tools. But if this interface was to be standardized, e.g. in the form of an algebraic data type in Haskell, then this would not be directly useful for people wishing to develop using some other language. Also, Haskell types are not very extensible, which would create all sorts of compatibility problems if the standard was to evolve. A third approach would be to define a standard, intermediary documentation format which is easy to generate (once one have the necessary information) and parse. Then, as long as at least one tool generating this format exists, it would be be relatively straightforward to develop all sorts of formatters and other creative applications around this. (Looking back at the history of Haskell documentation tools, this has actually happened at least three times: "FudgetsDoc" and HaskellDoc both used HBC's interface files to get type information, and more recently Jan Skibinski's source code browser which uses GHC's interface files in a similar way. But of course, in all cases, these tools became tied to one (or two) particular compiler(s), they became likely to break if the format of the interface files changed, and they were limited by the information that happened to be available. Hence the need for a standard.) Personally, I think a compiler would be in a good position to generate intermediary documentation files since it has access to all (or at least most) informatin that is needed. (This is also not without precedent: Sun's Workshop C compiler can emit information for a browsing tool, the CenterLinc C compiler used to do something similar, and asking compilers for module dependence information is a basically a simple instance of the same idea.) On the other hand, there are some problems such as the need to respect user-supplied type signatures (as opposed to always using the inferred ones), and the fact that the types of non-exported entities might be thrown away at some inconveniently early point. So not everyone likes this. However, how intermediary documentation is generated is a secondary issue. Having a well-specified format means that anyone who would like to write a tool supplying such information has something to aim at, and that anyone who is manly interested in doing something with such information has a goodplace to start from. Finally, I believe that developing the source-level documentation conventions and an intermediary documentation format in parallel will be mutually beneficial. Defining the intermediary format will force us to think about what documentation *is* (without the need to consider specific renderings) and thus what information that needs to be provided by the commenting conventions. Converesly, practical requirements such as the source code remaining legible with prevent the intermediary format from becoming too unwieldy. ----------------------------------------------------------------------- 4. I think the intermedite format should be based on XML. I think this simply because XML is a rapidly emerging standard which was developed with precicely this kind of appliction (sematic markup) in mind. A large number of tools related to XML is already available, including some Haskell ones. Best regards, /Henrik -- Henrik Nilsson Yale University Department of Computer Science nilsson@cs.yale.edu