
Daniel McAllansmith
Hello.
I have some html from which I want to extract records. Each record is represented within a number of <tr> nodes, and all records <tr> nodes are contained by the same parent node.
This is very poorly written HTML. The original structure of the data is destroyed - the parse tree no longer reflects the data structure. (If a record is to be displayed in several rows, there are proper ways.) It is syntactically incorrect: nested <tr>, and color in <hr>. (Just ask http://validator.w3.org/ .) I trust that you are parsing this because you realize it is all wrong and you want to programmatically convert it to proper markup. Since the file is unstructured, I choose not to sweat over restoring the structure in an HXT arrow. The HXT arrow will return a flat list, just as the file is a flat ensemble. The list looks like: ["/prod17", "Television", " (code: 17)", "A very nice telly.", "/prod24", "Cyclotron", " (code: 24)", "Mind your fillings."] I then use a pure function to decompose this list four items at a time to emit the desired records. This is trivial outside HXT arrows. I use tuples, and every field is a string; you can easily change the code to produce Prod's, turn " (code: 17)" into the number 17, etc. Here is a complete, validated HTML 4 file containing the table, just so that my program below actually has valid input. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>Products</title> </head> <body> <table> <tr> <td><strong>Product:</strong></td> <td><strong><a href="/prod17">Television</a></strong> (code: 17)</td> </tr> <tr> <td><strong>Description:</strong></td> <td>A very nice telly.</td> </tr> <tr> <td><hr></td> </tr> <tr> <td><strong>Product:</strong></td> <td><strong><a href="/prod24">Cyclotron</a></strong> (code: 24)</td> </tr> <tr> <td><strong>Description:</strong></td> <td>Mind your fillings.</td> </tr> <tr> <td><hr></td> </tr> </table> </body> </html> Here is my program: import Text.XML.HXT.Arrow main = do { unstructured <- runX (p "table.html") ; let structured = s unstructured ; print structured } p filename = readDocument [(a_parse_html,"1")] filename >>> deep (isElem >>> hasName "table") >>> getChildren >>> isElem >>> hasName "tr" >>> getChildren >>> isElem >>> hasName "td" >>> getChildren >>> p1 <+> p2 p1 = isElem >>> hasName "strong" >>> getChildren >>> isElem >>> hasName "a" >>> getAttrValue "href" <+> (getChildren >>> getText) p2 = getText s (a:b:c:d: rest) = (a,b,c,d) : s rest s _ = []