On Tue, Dec 24, 2013 at 1:42 PM, Brandon Allbery <allbery.b@gmail.com> wrote:
On Tue, Dec 24, 2013 at 2:20 PM, akira kawata <a.kawashiro@gmail.com> wrote:
Did you mean HaXmL?

Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should* be XML compatible, although it's very rare to find proper well-formed HTML these days....


This is actually not true; for example, not closing your <br> tags is perfectly valid HTML5 but invalid XML, and you can use > literals in script tags. The CDATA-inside-comments hack isn't necessary and hasn't been for years. You should try to parse HTML as HTML.

That being said, if html-conduit works for you, use it; if not, try TagSoup, which doesn't try to structure your data into a DOM.

<html>
<p> hogehoge </p>
<script>if(window.mw){
mw.loader.state({"<script>":"</script>","user":"ready","user.groups":"ready"});
}
</script>
</html>

It's worth noting that the browser will probably interpret the quoted </script> as the end-of-script marker; Chrome did when I copied this into an HTML file and saved it. You need to replace it with "</scr" + "ipt>" or something similar. I'm a little surprised html-conduit doesn't interpret </script> as end-of-script.