Re: [Haskell-cafe] Programming style and XML processing in Haskell

Just sticking in my two pence worth... I am not sure what application you intend this for, but I find most XML parsers completely useless. With my application programmers hat on, I do not want to validate against a DTD, I want to extract as much information as possible from bad XML... what I would like is a correcting parser - one which outputs XML in compliance, but will accept any old rubbish and make a best guess attempt to fix it up (based on a set of configurable heuristic rules)... Secondly I deal with very large documents, the tree form of which won't fit in memory, so I would see an XML parser doin the following... parser :: String -> [XmlElements] filter :: [XmlElements] -> [XmlElements] reader :: [XmlElements] -> ... output data types ... writer :: ... input data types ... -> [XmlElements] render :: [XmlElements] -> String In order to keep track of the tree structure the tree-depth of each element is encoded within the XmlElement type... thus allowing the data to be streamed through the filters/readers etc. This means the parser can output the first element as soon as it encounters the second element (lazy list == stream in Haskell) rather than having to wait until the last element as would happen with a DOM tree (it is a tree not a graph as XML elements can only contain sub-elements)... As I said the above is just my opinion, and as it happens I have written a parser that does the above... I guess that is why there are several parsers for XML available (different requirements) and there will probably be many more ... Regards, Keean.

At 17:45 13/05/04 +0100, MR K P SCHUPKE wrote:
Just sticking in my two pence worth...
I am not sure what application you intend this for, but I find most XML parsers completely useless. With my application programmers hat on, I do not want to validate against a DTD, I want to extract as much information as possible from bad XML... what I would like is a correcting parser - one which outputs XML in compliance, but will accept any old rubbish and make a best guess attempt to fix it up (based on a set of configurable heuristic rules)...
I would think this is a rather specialized requirement. I certainly don't want a "correcting" parser for my work. But I can see that some applications might...
Secondly I deal with very large documents, the tree form of which won't fit in memory, so I would see an XML parser doin the following...
parser :: String -> [XmlElements]
filter :: [XmlElements] -> [XmlElements]
reader :: [XmlElements] -> ... output data types ...
writer :: ... input data types ... -> [XmlElements]
render :: [XmlElements] -> String
In order to keep track of the tree structure the tree-depth of each element is encoded within the XmlElement type... thus allowing the data to be streamed through the filters/readers etc. This means the parser can output the first element as soon as it encounters the second element (lazy list == stream in Haskell) rather than having to wait until the last element as would happen with a DOM tree (it is a tree not a graph as XML elements can only contain sub-elements)...
This seems reasonable, and I'd expect a reasonable implementation (of a filter) to stream via lazy evaluation where that matches the final usage pattern. The outline I sketched (copied below) was intended to be built upon something like HaXML's filter idea, so that streaming processing would (in principle) be possible. My requirement is not to generate yet more XML, but to extract something quite different from the XML, so I think I'd be looking for something like your 'reader', which could be part of the lowest element in my diagram.
As I said the above is just my opinion, and as it happens I have written a parser that does the above... I guess that is why there are several parsers for XML available (different requirements) and there will probably be many more ...
I agree about the different requirements, but I think it would be good if this didn't mean different XML libraries; I'm fishing for an arrangement that allows the different requirements to be satisfied from common (or overlapping) components. I like your suggested parser/filter/reader/writer/render model, and I'll consider how that fits with the existing libraries (I really don't want to start from scratch here). I guess a 'parser' could be a special case of 'writer', and 'render' a special case of 'reader'. #g -- (Reprise of last part of my previous message...) What do I think an XML library for Haskell should look like? The component's I'd like to see would look something like this: XML parser :: String ---> (internal representation) | \ | -----> IO function to perform full validation | and external DTD handling [optional] v XML filter combinators --+--> entity substitution logic | +--> namespace handling | +--> XSLT processing [optional, for now **] v DOM-like read-only interface for access to data at level comparable to XML infoset (used to avoid dependency between applications that use infoset data and details of the internal representation used.) ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

On Thu, 13 May 2004 17:45:25 +0100 (BST), MR K P SCHUPKE
I am not sure what application you intend this for, but I find most XML parsers completely useless. With my application programmers hat on, I do not want to validate against a DTD, I want to extract as much information as possible from bad XML... what I would like is a correcting parser - one which outputs XML in compliance, but will accept any old rubbish and make a best guess attempt to fix it up (based on a set of configurable heuristic rules)...
Bear in mind that such a parser would not be in conformance with the XML specification. The XML Working Group applied the lessons of HTML and concluded that lax parsing rules lead to far more trouble than they're worth, and so the XML specification explicitly lists the kinds of well-formedness errors that might occur in an XML document and that are _required_ to be flagged by a conforming XML processor as fatal errors. (And also note that in XML-speak, "valid" and "well-formed" are not the same thing--an XML parser can be conforming without doing validation, and an XML document can be well-formed without being valid.) Obviously, you can do whatever you want in your own code, but I don't think you should hold your breath waiting for someone else to come up with that kind of pseudo-XML parser, since by definition it would be a special-purpose tool. Steve Schafer Fenestra Technologies Corp http://www.fenestra.com/
participants (3)
-
Graham Klyne
-
MR K P SCHUPKE
-
Steve Schafer