
On Tue, Dec 24, 2013 at 1:42 PM, Brandon Allbery
On Tue, Dec 24, 2013 at 2:20 PM, akira kawata
wrote: Did you mean HaXmL?
Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should* be XML compatible, although it's very rare to find proper well-formed HTML these days....
This is actually not true; for example, not closing your <br> tags is perfectly valid HTML5 but invalid XML, and you can use > literals in script tags. The CDATA-inside-comments hack isn't necessary and hasn't been for years. You should try to parse HTML as HTML. That being said, if html-conduit works for you, use it; if not, try TagSoup, which doesn't try to structure your data into a DOM. <html>
<p> hogehoge </p> <script>if(window.mw){ mw.loader.state({"<script>":"</script>","user":"ready"," user.groups":"ready"}); } </script> </html>
It's worth noting that the browser will probably interpret the quoted </script> as the end-of-script marker; Chrome did when I copied this into an HTML file and saved it. You need to replace it with "" or something similar. I'm a little surprised html-conduit doesn't interpret </script> as end-of-script.