
Apologies if this is a duplicate, the original appears to have gone astray. On Wednesday 01 November 2006 10:57, Albert Lai wrote:
Daniel McAllansmith
writes: Hello.
I have some html from which I want to extract records. Each record is represented within a number of <tr> nodes, and all records <tr> nodes are contained by the same parent node.
This is very poorly written HTML. The original structure of the data is destroyed - the parse tree no longer reflects the data structure. (If a record is to be displayed in several rows, there are proper ways.) It is syntactically incorrect: nested <tr>, and color in <hr>. (Just ask http://validator.w3.org/ .)
Indeed. The original is even worse, with overlapping nodes and other such treasures which makes navigation in HXT tricky at times.
I trust that you are parsing this because you realize it is all wrong and you want to programmatically convert it to proper markup.
Yep! I sure wouldn't be doing this if I had control of the the original HTML.
Since the file is unstructured, I choose not to sweat over restoring the structure in an HXT arrow. The HXT arrow will return a flat list, just as the file is a flat ensemble.
I was about to write a follow-up just as your mail came in... I've ended up with the same solution as you've kindly suggested. Another option I came across is Control.Arrow.ArrowTree.changeChildren which could be used to restore a more normalised structure ready for more processing. Thanks Daniel