updates to HaXml (1.13.1 and 1.16)

For those who use the current stable version of HaXml, I'd like to announce a new patch-level release, 1.13.1, which contains the following bugfixes: * permit percent character in attribute values * parse unquoted attribute values starting '+' or '#' in HTML * keep the original DTD in the output of 'processXmlWith' See http://www.haskell.org/HaXml/ for downloads. For those living on the development edge, I'd like to report that the current darcs version darcs get http://www.cs.york.ac.uk/fp/darcs/HaXml contains a new set of parser combinators (with the same API as before) that is lazier, whilst still allowing backtracking. By lazy, I mean it can start to return partial values as soon as it has consumed e.g. the start tag of an element, without waiting to check that the close tag matches. This has two good effects: * your program will run faster * it will consume less memory and two bad effects: * if there are errors in the document, they will throw an exception in the middle of your processing * the error message in the exception may be rather less accurate about the cause and location than previously. The older XML parser has also been retained, since the lazy version is still experimental. To use the new one, import Text.XML.HaXml.ParseLazy There are also lazy versions of the usual demo programs CanonicaliseLazy XtractLazy As an example of the improved speed, a query to extract all the <key> tags from a 3.7Mb XML document: Xtract "//key" file.xml did not give any results after more than ten minutes on my machine, but XtractLazy "//key" file.xml started producing results immediately, and completed the task in 25 seconds (returning 52584 tags). Separate website and downloads at http://www.cs.york.ac.uk/fp/HaXml-devel Regards, Malcolm

On Mon, 10 Jul 2006, Malcolm Wallace wrote:
For those living on the development edge, I'd like to report that the current darcs version darcs get http://www.cs.york.ac.uk/fp/darcs/HaXml contains a new set of parser combinators (with the same API as before) that is lazier, whilst still allowing backtracking. By lazy, I mean it can start to return partial values as soon as it has consumed e.g. the start tag of an element, without waiting to check that the close tag matches. This has two good effects:
* your program will run faster * it will consume less memory
I'm currently trying the latest version from the Darcs repository. The Canonicalise* examples output a lot of spaces within tags.

Henning Thielemann
darcs get http://www.cs.york.ac.uk/fp/darcs/HaXml
I'm currently trying the latest version from the Darcs repository. The Canonicalise* examples output a lot of spaces within tags.
Yes, that is expected behaviour, due to the pretty printer. In order to avoid adding extra whitespace around text (which could be significant), yet still preserve the hierarchical structure of the tree (by indentation), all "structuring" whitespace is placed inside the tags themselves. I agree it is not ideal behaviour. I would like to offer the possibility of more traditional indentation as well (as an option). No-one has yet coded it however. Regards, Malcolm

On Mon, 10 Jul 2006, Malcolm Wallace wrote:
Henning Thielemann
wrote: darcs get http://www.cs.york.ac.uk/fp/darcs/HaXml
I'm currently trying the latest version from the Darcs repository. The Canonicalise* examples output a lot of spaces within tags.
Yes, that is expected behaviour, due to the pretty printer. In order to avoid adding extra whitespace around text (which could be significant), yet still preserve the hierarchical structure of the tree (by indentation), all "structuring" whitespace is placed inside the tags themselves.
I see, but there are more than 1000 space characters in many tags. Thus the document size is multiplied by about 100. I could solve the problem for me by using renderStyle with LeftMode.
I agree it is not ideal behaviour. I would like to offer the possibility of more traditional indentation as well (as an option). No-one has yet coded it however.
I assume that it is difficult to disable nesting locally with the pretty printer, isn't it? We would need that for content enclosed in PRE tags.

Hello Malcolm, Monday, July 10, 2006, 5:15:00 PM, you wrote:
As an example of the improved speed, a query to extract all the <key> tags from a 3.7Mb XML document: Xtract "//key" file.xml did not give any results after more than ten minutes on my machine, but XtractLazy "//key" file.xml started producing results immediately, and completed the task in 25 seconds (returning 52584 tags).
cool thing! can be used for Haskell advertisement :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (3)
-
Bulat Ziganshin
-
Henning Thielemann
-
Malcolm Wallace