Another option is the xmlhtml package, which I wrote and is used by Heist.
An important factor in this decision will be what range of input you need to accept, and what you want as a result. A fully compliant HTML5 parser will parse most input, but the resulting data will be somewhat complex. On the other hand, xmlhtml will accept a smaller subset of HTML5 (but will handle your sample input here just fine) and produce a much simpler output. TagSoup, which someone else recommended, will accept even more, and produce flatter output, but I don't know how it would perform on this input.
On Tue, Dec 24, 2013 at 1:42 PM, Brandon Allbery <allbery.b@gmail.com> wrote:
On Tue, Dec 24, 2013 at 2:20 PM, akira kawata <a.kawashiro@gmail.com> wrote:Did you mean HaXmL?Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should* be XML compatible, although it's very rare to find proper well-formed HTML these days....This is actually not true; for example, not closing your <br> tags is perfectly valid HTML5 but invalid XML, and you can use > literals in script tags. The CDATA-inside-comments hack isn't necessary and hasn't been for years. You should try to parse HTML as HTML.That being said, if html-conduit works for you, use it; if not, try TagSoup, which doesn't try to structure your data into a DOM.<html>
<p> hogehoge </p>
<script>if(window.mw){
mw.loader.state({"<script>":"</script>","user":"ready","user.groups":"ready"});
}
</script>
</html>It's worth noting that the browser will probably interpret the quoted </script> as the end-of-script marker; Chrome did when I copied this into an HTML file and saved it. You need to replace it with "</scr" + "ipt>" or something similar. I'm a little surprised html-conduit doesn't interpret </script> as end-of-script.
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe