Parse HTML that is contain javascript

Hi, I am Akira. I want to parse HTML file that is contain javascript. But I cant come up with how to deal with script tag. Is there anyone help me? Details of probrem HTML code I want to parse is like following <html> <script> //<![CDATA[ <!-- --> //]]> </script> </html> Because '<' is used as normal character in the script region, I can not use my HTML parser there.

On Tue, Dec 24, 2013 at 2:03 PM, akira kawata
<html> <script> //<![CDATA[ <!-- --> //]]> </script> </html>
An XML parser might help with CDATA blocks. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

Did you mean HaXmL?
I am sorry that I can't explain what I want well.
I think this module cannot parse HTML file like this.
I don't mind the javascript code.
I want to trancelate following code
<html>
<p> hogehoge </p>
<script>if(window.mw){
mw.loader.state({"<script>":"</script>","user":"ready","user.groups":"ready"});
}
</script>
</html>
to like this
<html>
<p>
hogehoge
<script>
in short, I want structure of HTML excludeing javascript.
2013/12/25 Brandon Allbery
On Tue, Dec 24, 2013 at 2:03 PM, akira kawata
wrote: <html> <script> //<![CDATA[ <!-- --> //]]> </script> </html>
An XML parser might help with CDATA blocks.
-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

On Tue, Dec 24, 2013 at 2:20 PM, akira kawata
Did you mean HaXmL?
Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should* be XML compatible, although it's very rare to find proper well-formed HTML these days.... -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

The html-conduit package (http://hackage.haskell.org/package/html-conduit)
can parse the above snippet easily: http://lpaste.net/97491
This code reads from stdin and prints out the parsed HTML. Try it out! For
documentation on the returned AST take a look at xml-conduit (
http://hackage.haskell.org/package/xml-conduit)
On 24 December 2013 19:42, Brandon Allbery
On Tue, Dec 24, 2013 at 2:20 PM, akira kawata
wrote: Did you mean HaXmL?
Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should* be XML compatible, although it's very rare to find proper well-formed HTML these days....
-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Dec 24, 2013 at 1:42 PM, Brandon Allbery
On Tue, Dec 24, 2013 at 2:20 PM, akira kawata
wrote: Did you mean HaXmL?
Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should* be XML compatible, although it's very rare to find proper well-formed HTML these days....
This is actually not true; for example, not closing your <br> tags is perfectly valid HTML5 but invalid XML, and you can use > literals in script tags. The CDATA-inside-comments hack isn't necessary and hasn't been for years. You should try to parse HTML as HTML. That being said, if html-conduit works for you, use it; if not, try TagSoup, which doesn't try to structure your data into a DOM. <html>
<p> hogehoge </p> <script>if(window.mw){ mw.loader.state({"<script>":"</script>","user":"ready"," user.groups":"ready"}); } </script> </html>
It's worth noting that the browser will probably interpret the quoted </script> as the end-of-script marker; Chrome did when I copied this into an HTML file and saved it. You need to replace it with "" or something similar. I'm a little surprised html-conduit doesn't interpret </script> as end-of-script.

I'm a little surprised html-conduit doesn't interpret </script> as end-of-script. It does interpret it as end-of-script. As far as i know that is the correct behaviour
On 24 December 2013 19:58, Patrick Hurst
On Tue, Dec 24, 2013 at 1:42 PM, Brandon Allbery
wrote: On Tue, Dec 24, 2013 at 2:20 PM, akira kawata
wrote: Did you mean HaXmL?
Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should* be XML compatible, although it's very rare to find proper well-formed HTML these days....
This is actually not true; for example, not closing your <br> tags is perfectly valid HTML5 but invalid XML, and you can use > literals in script tags. The CDATA-inside-comments hack isn't necessary and hasn't been for years. You should try to parse HTML as HTML.
That being said, if html-conduit works for you, use it; if not, try TagSoup, which doesn't try to structure your data into a DOM.
<html>
<p> hogehoge </p> <script>if(window.mw){ mw.loader.state({"<script>":"</script>","user":"ready"," user.groups":"ready"}); } </script> </html>
It's worth noting that the browser will probably interpret the quoted </script> as the end-of-script marker; Chrome did when I copied this into an HTML file and saved it. You need to replace it with "" or something similar. I'm a little surprised html-conduit doesn't interpret </script> as end-of-script.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Another option is the xmlhtml package, which I wrote and is used by Heist.
An important factor in this decision will be what range of input you need
to accept, and what you want as a result. A fully compliant HTML5 parser
will parse most input, but the resulting data will be somewhat complex. On
the other hand, xmlhtml will accept a smaller subset of HTML5 (but will
handle your sample input here just fine) and produce a much simpler
output. TagSoup, which someone else recommended, will accept even more,
and produce flatter output, but I don't know how it would perform on this
input.
On Dec 24, 2013 2:58 PM, "Patrick Hurst"
On Tue, Dec 24, 2013 at 1:42 PM, Brandon Allbery
wrote: On Tue, Dec 24, 2013 at 2:20 PM, akira kawata
wrote: Did you mean HaXmL?
Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should* be XML compatible, although it's very rare to find proper well-formed HTML these days....
This is actually not true; for example, not closing your <br> tags is perfectly valid HTML5 but invalid XML, and you can use > literals in script tags. The CDATA-inside-comments hack isn't necessary and hasn't been for years. You should try to parse HTML as HTML.
That being said, if html-conduit works for you, use it; if not, try TagSoup, which doesn't try to structure your data into a DOM.
<html>
<p> hogehoge </p> <script>if(window.mw){ mw.loader.state({"<script>":"</script>","user":"ready"," user.groups":"ready"}); } </script> </html>
It's worth noting that the browser will probably interpret the quoted </script> as the end-of-script marker; Chrome did when I copied this into an HTML file and saved it. You need to replace it with "" or something similar. I'm a little surprised html-conduit doesn't interpret </script> as end-of-script.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I've used HXT with the tagsoup backend for parsing HTML with embedded JavaScript. Worked fine for me, although I don't think I've ever had to deal with CDATA embedded in comments of scripts. You can have a look at the source of the 'jespresso' library on hackage if interested. On 12/24/2013 11:03 PM, akira kawata wrote:
Hi, I am Akira. I want to parse HTML file that is contain javascript. But I cant come up with how to deal with script tag. Is there anyone help me?
Details of probrem HTML code I want to parse is like following
<html> <script> //<![CDATA[ <!-- --> //]]> </script> </html>
Because '<' is used as normal character in the script region, I can not use my HTML parser there.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (6)
-
akira kawata
-
Andras Slemmer
-
Andrey Chudnov
-
Brandon Allbery
-
Chris Smith
-
Patrick Hurst