
On 13 November 2010 16:46, Neil Mitchell
I've been working on a project that requires me to do screen scraping.
If you are screen scraping HTML I think tagsoup is a very good choice. The use of tagsoup means that you have a real HTML 5 compliant parser underneath, and then you can use whatever technique you wish to split up the page text - and regular expressions/parsec might be a reasonable choice. I've written lots of screen scraping stuff with tagsoup, and it's usually very easy - the manual even walks you through a couple of examples: http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm
Agreed, the tagsoup library just works. I've used it plenty of times for my scraping needs. E.g. scraping from paste sites: https://github.com/chrisdone/amelie/blob/master/src/Amelie/Import.hs#L84 https://github.com/chrisdone/hpaste-feed/blob/master/main.hs#L65 You can always regex match on what tagsoup gives you, too.