
After an extended period of procrastination and nipping at the edges of the problem, I feel a need to tackle head on the requirement for a "usable" XML handling library in Haskell. As far back as 1999, Malcolm Wallace and Colin Runciman observed that "Haskell is a very suitable language for XML processing" [1], yet there still does not seem to be a generically useful XML handling library, suitable (say) for inclusion as part of a standard Haskell compiler library. Why do I say this? I have looked at three separate XML libraries, and each has problems which I perceive make them unsuitable for the purposes I have in mind. It may be that part of my problem here is a mistaken view of Haskell programming style, possible exposure of which is one of my reasons for posting this. My immediate goal is to create an RDF/XML parser, the output from which is a data structure representing an RDF graph. This involves parsing the XML to an XML-infoset-like form, then traversing this to extract information for the RDF graph. I want to create a function like this: parseRDFXML :: String -> RDFGraph My XML processing requirements: (1) basic XML parsing (2) predefined and character entity handling (< & n; etc) (3) general entity handling (DTD entity definitions and substitutions, per internal DTD subset) (4) easy access to values extracted from XML data (i.e. for XML-to-non-XML processing) (5) XML namespace handling (6) library usable outside the IO monad (i.e. by functions that return non-IO values) Non-requirements, but maybe nice to have: (7) parameter entities (8) External entities (9) XML/DTD validation (1)-(4) correspond roughly to the level of support required of XML parsers for handling "standalone" documents. Almost all modern usage of XML that I'm aware of depends to lesser or greater degree on on (5). (8) and (9) are, I think, in conflict with having library functions that can be used outside the IO monad, since they require that the parser be able to access external data. (7) is really a helper for DTD-based validation, and my own view is that validity checking is better performed using XML schema. Some of the facilities provided by (8) are now being addressed by alternative activities that build upon a basic XML (XInclude, Binary attachments for SOAP, etc.). Requirement (6) arises for me because I have adopted a style of programming in Haskell that is mostly consisting of pure functions, without recourse to monads. I use parser monads locally as required, and I use IO and state monads at the upper levels of my programs to deal with and record the program's interaction with the outside world. It seems to me that this approach leads to functions that are easier to pick up and use. I've found that, when using third party libraries, stand-alone functions present an easier learning curve compared with libraries that are based around a (sometimes complex) monadic state. Maybe I'm missing something here? ... Turning to the XML libraries, I've looked at three: (A) Joe English's HXML parser [2] (B) HaXml [3] (C) Haskell Xml Toolbox [4] (A) HXML is very easy to understand and use, but it does very little more than basic XML parsing. No level of DTD handling is provided, as far as I've been able to determine. (B) HaXML does a little more of what I want, to the extent that it can parse DTDs, and even perform some basic validity checking. I can't find anywhere in the code that seems to address substitution of entities defined in the DTD, and I'm not sure if it can parse a DTD and XML from the same XML file. There are references in the code to external DTD subsets, but I can't see any attempt to implement this. I have found that the HaXML's error handling is rather severe, in that there are a wide of input data errors that cause the library to 'error' rather than return a diagnostic value. (C) Haskell Xml Toolbox is the most functional package (being the only package with XML namespace support) and also the most difficult to use. Unfortunately, it seems that much of the DTD functionality (needed for expanding general entities) is performed I/O monad, as it is part of the code than performs validation, which, as noted above, needs access to external resources. My biggest problem with this package is that it seems to be very difficult and unwieldy to use as part of another library: much of the code seems to be oriented toward creating complete programs for XML-to-XML transformations of various kinds. I've a view that XML namespace support should be quite easy to graft onto either of the other packages, given an extension to the data type used to describe nodes and elements. Over the past couple of months, I've been wavering between pushing ahead with (B) or (C). Both have problems, and either would require significant effort on my part. If I use (C), it involves the least amount of new code, but I think I would find myself ripping out chunks of code to create functions that I can use outside the IO monad, which would effectively fork the codebase. If I use (B), I need to address the error handling problem, though I think I know roughly how to do that (I already made a start; details below). A previous problem I had was that the HaXML code needs CPP preprocessing, which was problematic for me, but since then a simple CPP-equivalent in Haskell has been implemented so I think I can work around that problem. I think I'd need to write new code to deal with entity substitution and namespaces, but I think both of those could be implemented as filters that layer on top of the basic package. So the pendulum swings again, and I now think that HaXML is looking like the most promising base for further development. ... What do I think an XML library for Haskell should look like? The component's I'd like to see would look something like this: XML parser :: String ---> (internal representation) | \ | -----> IO function to perform full validation | and external DTD handling [optional] v XML filter combinators --+--> entity substitution logic | +--> namespace handling | +--> XSLT processing [optional, for now **] v DOM-like read-only interface for access to data at level comparable to XML infoset (used to avoid dependency between applications that use infoset data and details of the internal representation used.) Does this seem reasonable? [**] my thought is that an XSLT document could be "compiled" into an XML filter function. #g -- [1] http://www.cs.york.ac.uk/fp/HaXml/icfp99.html#furtherwork [2] http://www.flightlab.com/~joe/hxml/ [3] http://www.cs.york.ac.uk/fp/HaXml/ [4] http://www.fh-wedel.de/~si/HXmlToolbox/ .... Work I've already done on HaXML: I made a start on a unit test program, and some modifications to the HMW combinator library to allow parse errors to be handled by the calling program. The initial test data has been stolen from the Hxml Toolbox software kit. The 4 test cases all run without errors under Hugs. Feel free to grab anything you think may be useful. The revisions are here: http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/ The test program and data files are here: http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/test/ (The test program has a commented-out feature to generate formatted versions of the input files which can be renamed for use as comparison test data.) The modified source code includes: + http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/src/Text/ParserCo... modified to include an option to return a diagnostic message or parser result, via an (Either String a) value. The original interface is (mostly) preserved, and new functions added to support the extended return values (e.g. papplydiag). + http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/src/Text/XML/HaXm... modified to work with the regised parser structure. A new function, xmlParseDiag, added to return an (Either String a) value. Also added an eof parser and documentOnly functions to perform some of the function of sanitycheck. (I also commented out the #if stuff so I could test under Hugs.) (Malcolm pointed out to me that the lexer also throws some errors, but I think that could be addressed by returning an error token and leaving the parser to deal with the resulting syntax error.) ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

Something which wasn't mentioned but is quite useful is type-specialised xml parsers. Some tools manipulate xml generically, giving you back some DOM tree which is great if you are writing a general purpose xml parser. However most uses know exactly what DTD/Schema/Type they are dealing with and would like to get their own data type back from the parser (as well as having the parser validate it). This allows you to use the parser/pretty printer in a similar way to ordinary read/show. Other people have pointed out that this should make xslt-style transformations really easy (and type safe). (Automatically deriving readXML/showXML would be nice!) Some Haskell xml libs/tookits have tools for converting DTD<->Haskell types. I suggest this would be a very useful feature of a standard xml library. Duncan

At 18:21 13/05/04 +0100, Duncan Coutts wrote:
Something which wasn't mentioned but is quite useful is type-specialised xml parsers. ...
HaXML has something like this. I'd suggest that something like this wouldn't necessarily be *part* of an XML library, but an additional XML library that uses a generic XML framework. But maybe that's what you meant? As it happens, it's not part of the requirements I'm looking at, because my aim is extract information from XML to another internal format, but I acknowledge the possible value of this. Maybe it's worth reviewing what Wallace&Runciman say about this (section 3.4 of their paper [1]). I don't know if any of this work has been updated to take account of XML schemas. #g -- [1] http://www.cs.york.ac.uk/fp/HaXml/icfp99.html At 18:21 13/05/04 +0100, Duncan Coutts wrote:
Something which wasn't mentioned but is quite useful is type-specialised xml parsers. Some tools manipulate xml generically, giving you back some DOM tree which is great if you are writing a general purpose xml parser. However most uses know exactly what DTD/Schema/Type they are dealing with and would like to get their own data type back from the parser (as well as having the parser validate it). This allows you to use the parser/pretty printer in a similar way to ordinary read/show. Other people have pointed out that this should make xslt-style transformations really easy (and type safe). (Automatically deriving readXML/showXML would be nice!)
Some Haskell xml libs/tookits have tools for converting DTD<->Haskell types.
I suggest this would be a very useful feature of a standard xml library.
Duncan
------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

Mensaje citado por Duncan Coutts
Something which wasn't mentioned but is quite useful is type-specialised xml parsers. Some tools manipulate xml generically, giving you back some DOM tree which is great if you are writing a general purpose xml parser. However most uses know exactly what DTD/Schema/Type they are dealing with and would like to get their own data type back from the parser (as well as having the parser validate it). This allows you to use the parser/pretty printer in a similar way to ordinary read/show. Other people have pointed out that this should make xslt-style transformations really easy (and type safe). (Automatically deriving readXML/showXML would be nice!)
Some Haskell xml libs/tookits have tools for converting DTD<->Haskell types.
We made some experiments embedding G-codes (ISO 6983) in Haskell. We defined a DTD for the G-codes format and use HaXml as the authors indicate. It works! both experiments they said. HaXml is useful for converting XML <-> Haskell!
I suggest this would be a very useful feature of a standard xml library.
Duncan
We suggest more comprehensive experiments before convert HaXml a standard Haskell library. Regards. Gustavo. ------------------------------------------------- This mail sent through IMP: http://horde.org/imp/
participants (3)
-
Duncan Coutts
-
garroyo@dsic.upv.es
-
Graham Klyne