Accumulating related XML nodes using HXT

Hello. I have some html from which I want to extract records. Each record is represented within a number of <tr> nodes, and all records <tr> nodes are contained by the same parent node. The things I've tried so far end up giving me the cartesian product of record fields, so for the html fragment included below I'd end up with: [ Prod "Television" 17 "/prod17" "A very nice telly." , Prod "Television" 17 "/prod17" "Mind your fillings." , Prod "Cyclotron" 24 "/prod24" "A very nice telly." , Prod "Cyclotron" 24 "/prod24" "Mind your fillings." ] instead of: [ Prod "Television" 17 "/prod17" "A very nice telly." , Prod "Cyclotron" 24 "/prod24" "Mind your fillings." ] How should I go about accumulating related <tr> nodes into individual records? Thanks Daniel HTML fragment follows: ... <tr> <tr> <td><strong>Product:</strong></td> <td><strong><a href="/prod17">Television</a></strong> (code: 17)</td> </tr> <tr> <td><strong>Description:</strong></td> <td>A very nice telly.</td> </tr> <tr> <td><hr color="#00000"></td> </tr> <tr> <td><strong>Product:</strong></td> <td><strong><a href="/prod24">Cyclotron</a></strong> (code: 24)</td> </tr> <tr> <td><strong>Description:</strong></td> <td>Mind your fillings.</td> </tr> <tr> <td><hr color="#00000"></td> </tr> </tr> ...

Daniel McAllansmith
Hello.
I have some html from which I want to extract records. Each record is represented within a number of <tr> nodes, and all records <tr> nodes are contained by the same parent node.
This is very poorly written HTML. The original structure of the data is destroyed - the parse tree no longer reflects the data structure. (If a record is to be displayed in several rows, there are proper ways.) It is syntactically incorrect: nested <tr>, and color in <hr>. (Just ask http://validator.w3.org/ .) I trust that you are parsing this because you realize it is all wrong and you want to programmatically convert it to proper markup. Since the file is unstructured, I choose not to sweat over restoring the structure in an HXT arrow. The HXT arrow will return a flat list, just as the file is a flat ensemble. The list looks like: ["/prod17", "Television", " (code: 17)", "A very nice telly.", "/prod24", "Cyclotron", " (code: 24)", "Mind your fillings."] I then use a pure function to decompose this list four items at a time to emit the desired records. This is trivial outside HXT arrows. I use tuples, and every field is a string; you can easily change the code to produce Prod's, turn " (code: 17)" into the number 17, etc. Here is a complete, validated HTML 4 file containing the table, just so that my program below actually has valid input. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>Products</title> </head> <body> <table> <tr> <td><strong>Product:</strong></td> <td><strong><a href="/prod17">Television</a></strong> (code: 17)</td> </tr> <tr> <td><strong>Description:</strong></td> <td>A very nice telly.</td> </tr> <tr> <td><hr></td> </tr> <tr> <td><strong>Product:</strong></td> <td><strong><a href="/prod24">Cyclotron</a></strong> (code: 24)</td> </tr> <tr> <td><strong>Description:</strong></td> <td>Mind your fillings.</td> </tr> <tr> <td><hr></td> </tr> </table> </body> </html> Here is my program: import Text.XML.HXT.Arrow main = do { unstructured <- runX (p "table.html") ; let structured = s unstructured ; print structured } p filename = readDocument [(a_parse_html,"1")] filename >>> deep (isElem >>> hasName "table") >>> getChildren >>> isElem >>> hasName "tr" >>> getChildren >>> isElem >>> hasName "td" >>> getChildren >>> p1 <+> p2 p1 = isElem >>> hasName "strong" >>> getChildren >>> isElem >>> hasName "a" >>> getAttrValue "href" <+> (getChildren >>> getText) p2 = getText s (a:b:c:d: rest) = (a,b,c,d) : s rest s _ = []

On Wednesday 01 November 2006 10:57, Albert Lai wrote:
Daniel McAllansmith
writes: Hello.
I have some html from which I want to extract records. Each record is represented within a number of <tr> nodes, and all records <tr> nodes are contained by the same parent node.
This is very poorly written HTML. The original structure of the data is destroyed - the parse tree no longer reflects the data structure. (If a record is to be displayed in several rows, there are proper ways.) It is syntactically incorrect: nested <tr>, and color in <hr>. (Just ask http://validator.w3.org/ .)
Indeed. The original is even worse, with overlapping nodes and other such treasures which makes navigation in HXT tricky at times.
I trust that you are parsing this because you realize it is all wrong and you want to programmatically convert it to proper markup.
Yep! I sure wouldn't be doing this if I had control of the the original HTML.
Since the file is unstructured, I choose not to sweat over restoring the structure in an HXT arrow. The HXT arrow will return a flat list, just as the file is a flat ensemble.
I was about to write a follow-up just as your mail came in... I've ended up with the same solution as you've kindly suggested. Another option I came across is Control.Arrow.ArrowTree.changeChildren which could be used to restore a more normalised structure ready for more processing. Thanks Daniel

Apologies if this is a duplicate, the original appears to have gone astray. On Wednesday 01 November 2006 10:57, Albert Lai wrote:
Daniel McAllansmith
writes: Hello.
I have some html from which I want to extract records. Each record is represented within a number of <tr> nodes, and all records <tr> nodes are contained by the same parent node.
This is very poorly written HTML. The original structure of the data is destroyed - the parse tree no longer reflects the data structure. (If a record is to be displayed in several rows, there are proper ways.) It is syntactically incorrect: nested <tr>, and color in <hr>. (Just ask http://validator.w3.org/ .)
Indeed. The original is even worse, with overlapping nodes and other such treasures which makes navigation in HXT tricky at times.
I trust that you are parsing this because you realize it is all wrong and you want to programmatically convert it to proper markup.
Yep! I sure wouldn't be doing this if I had control of the the original HTML.
Since the file is unstructured, I choose not to sweat over restoring the structure in an HXT arrow. The HXT arrow will return a flat list, just as the file is a flat ensemble.
I was about to write a follow-up just as your mail came in... I've ended up with the same solution as you've kindly suggested. Another option I came across is Control.Arrow.ArrowTree.changeChildren which could be used to restore a more normalised structure ready for more processing. Thanks Daniel
participants (2)
-
Albert Lai
-
Daniel McAllansmith