Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

13 Nov 2010


      ...
I've been working on a project that requires me to do screen scraping.
If you are screen scraping HTML I think tagsoup is a very good choice.
The use of tagsoup means that you have a real HTML 5 compliant parser
underneath, and then you can use whatever technique you wish to split
up the page text - and regular expressions/parsec might be a
reasonable choice. I've written lots of screen scraping stuff with
tagsoup, and it's usually very easy - the manual even walks you
through a couple of examples:
http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm
...
He's very experienced, and comes from
a Perl perspective. I let him into what I was doing, and he opined I
should be using pcre.
When all you have is a hammer, everything looks like a thumb.
Structured manipulation of algebraic data types is trivial in Haskell,
and much less natural in Perl, so they use different techniques in
different places.
...
So now I'm second guessing my choices. Why do
people choose not to use regex for uri parsing?
If you mean HTML parsing, then it's because it's a nightmare to get
right, and people on the web do all kinds of crazy stuff. A correct
regular expression to match an HTML tag is lots of work. Given that
it's a solved problem, why go to all that effort. It is possible to do
with regular expressions, but not pleasant.

Thanks, Neil

Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

Neil Mitchell