HaXml and HXml toolbox; namespace support

I'm currently looking at the innards of HXml Toolbox and HaXml, with a view to adopting an XML parser with XML namespace support. Based on that requirement alone, HXml Toolbox would be the obvious choice, since it already has namespace support, but I have some concerns. These may simply be my own ignorance, so I'm airing my views here so that any misconceptions can be corrected. I present my thoughts in terms of pro's and con's for each. HXml Toolbox ------------ + XML Namespace support + DTD Entity handling + Good degree of conformance to W3C test suite - difficult to find way around documentation; no obvious high-level description, other than Martin Schmidt's thesis which is out-of-date with respect to the current software. - can't find simple String -> XML tree parsing function (dealing with Internal DTD Entity components) - errors seems to be reported to stderr rather than handed back to the calling program - complex and non-portable distribution: I'm concerned that any attempt distribute my applications based on this library may prove difficult, short of copying (and effectively branching) the complete source code. - not developed with Hugs/Windows as an intended target ? efficiency: some problems parsing large XML files with Hugs 98 are noted. ? still actively supported ? HaXml ----- + Already part of the common hierarchical library + XML handling is cleanly separated from other functions + separate, hand-coded lexer which I assume will give better performance + appears to be actively supported - no namespace support ? DTD Entity handling ? - errors returned to caller. As far as I can tell, errors are raised using the 'error' function... [which I see results in program termination when evaluated]. Ouch! (Why not 'fail' instead of 'error'?) - source code needs CPP preprocessing * no external DTD support [this is not a problem for me, and I'd certainly prefer it to be optional, or at least separated from the XML parsing, to avoid dependency on an HTTP library]. ... A weakness of both packages seems to be the handling of syntax errors in the input. HaXml uses HuttonMeijerWallace combinators - could these be extended in the style of Parsec to return an error description, thus making it possible to provide an interface that allows the calling program to handle any errors? E.g. [[ newtype Parser s t a = P (s -> [t] -> [(a,s,[t])]) ]] becomes, say: [[ newtype Parser s t a = P (s -> [t] -> Either String [(a,s,[t])]) ]] and define fail accordingly. Or, even, just use Parsec? HXml Toolbox makes mention of reporting errors to stderr, I think [lost reference]. It appears that I can isolate the XML parser, which uses Parsec, but I'm not sure if I can isolate the DTD processing logic that deals with entity substitutions.... This looks problematic: it seems that entity substitution is done in an XmlStateFilter Monad. I'm finding it really hard to tease apart the various strands of processing here, which is indicative of my concerns about using this package. ... So, any pointers that help me decide which way to jump would be appreciated... #g ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

Graham Klyne
HXml Toolbox ------------ - difficult to find way around documentation; no obvious high-level description, other than Martin Schmidt's thesis which is out-of-date with respect to the current software.
I fear that HaXml also suffers from inadequate documentation.
- not developed with Hugs/Windows as an intended target
HaXml does have the advantage of being tested with all three compilers: ghc, nhc98, and Hugs. As you have already discovered, support for Windows is limited, but together we have now developed a 'hack' to get it going.
? efficiency: some problems parsing large XML files with Hugs 98 are noted.
HaXml /may/ also suffer from space problems when parsing large XML files. Joe English's hxml parser is more lazy, and can be used as a drop-in replacement for HaXml's parser, if this turns to be a problem. http://www.flightlab.com/~joe/hxml/
HaXml ----- + Already part of the common hierarchical library
... and will be distributed as part of the next release of Hugs.
+ XML handling is cleanly separated from other functions + separate, hand-coded lexer which I assume will give better performance
With Haskell, never assume anything re expected performance. I too would hope the hand-coded lexer gives good performance, but if it matters, measure it. There are plenty of profiling tools available.
+ appears to be actively supported
... on a best-effort basis. I haven't got much time to develop HaXml actively myself, but am happy to make bugfixes and merge in new features contributed by others.
- no namespace support
HaXml ignores namespaces, yes. The namespace is simply incorporated into the full name of the element or attribute. It should be relatively easy to design filters for querying/transforming namespaces.
? DTD Entity handling ?
Parameter entity references (PERefs) are expanded in-line during parsing of the DTD. Because they are a macro facility and can occur at almost any point in the DTD structure, it is difficult to write a static datatype structure that includes PERefs fully -- so the Haskell datatypes representing the DTD do not include them at all. Thus, you cannot /generate/ PERefs in a DTD with HaXml, only read them. General entity references (GERef) are gathered into a lookup table at parse time, and stored inside the top-level document data structure: data Document = Document Prolog (SymTab EntityDef) Element None of the other HaXml functions do anything further with them, but in principle they are there in order to allow them to be used conveniently. (The definitions also remain in their original location within the DTD - they are not macros and do not need to be expanded away.)
- errors returned to caller. As far as I can tell, errors are raised using the 'error' function... [which I see results in program termination when evaluated]. Ouch! (Why not 'fail' instead of 'error'?)
Good point. Should be pretty easy to fix.
- source code needs CPP preprocessing
Entirely for cross-compiler compatibility.
* no external DTD support [this is not a problem for me, and I'd certainly prefer it to be optional, or at least separated from the XML parsing, to avoid dependency on an HTTP library].
It is perfectly possible to parse an external DTD separately from the content. The only question is how to find the DTD. Someone once worked on using the local Catalogue to get hold of the external DTD, given its SYSTEM reference, but I don't recall whether it was fed back to me, or if it was, why I didn't merge it - probably configuration issues about discovering the location of the Catalogue.
A weakness of both packages seems to be the handling of syntax errors in the input.
HaXml uses HuttonMeijerWallace combinators - could these be extended in the style of Parsec to return an error description, thus making it possible to provide an interface that allows the calling program to handle any errors?
Yes, certainly. As I recall, the original Hutton/Meijer papers on monadic parser combinators developed the scheme starting with 'Parser a', parameterised simply on the return type, through successively more complex types parameterised on token type, running state, and finally error type, ending up with 'Parser s t e a'. Your suggestion of returning an Either type might be a quick-and-easy compromise.
Or, even, just use Parsec?
You are welcome to rewrite HaXml's parser in Parsec if you wish. It might even become more space-efficient. Regards, Malcolm
participants (2)
-
Graham Klyne
-
Malcolm Wallace