Since the document claims it is HTML, you should be parsing it with an HTML parser. Try hxt-tagsoup -- specifically, the "parseHtmlTagSoup" arrow.