
Hello, I would recommend using TagSoup: http://www-users.cs.york.ac.uk/~ndm/tagsoup/ The tutorial easy, and has good advice: http://www.cs.york.ac.uk/fp/darcs/tagsoup/tagsoup.htm I would not bother trying to use a real XML parser, because I suspect that many of the XHTML pages you want to parse, are not actually valid XHTML, which means the XML parsers will fail. Also, some of the sites you are interested in might not be XHTML at all. So, using TagSoup for everything seems simpliest. The process is very lo-fi. Write some code using TagSoup which scrapes the data you care about from the web pages and turns it into Haskell data structures. This code should not be clever, and it will need to be updating whenever the site you are scraping changes enough to break your code. This process should work fine if you are talking about scraping data from some specific sites. If you want to make a web crawler which automatically finds relevant pages and scrapes the data, then that is a much bigger project. You will still want to use something like TagSoup to do the initial parsing, but extracting the data will be much trickier (though, possibly worth billions of $$$ if done well). j. ps. I only have experience with TagSoup, so there may be other libraries which are even better. The key feature of TagSoup is that it allows you to process malformed, invalid HTML -- which is important if you don't control the creation of the HTML you are parsing. At Sat, 02 Aug 2008 22:10:36 -0300, Rafael C. de Almeida wrote:
Hello,
I understand that nowadays there are several frameworks and wrapper libraries for making some sense of the XHTML documents you find over the web. That is, making the life of those who want to process the semi-structured data you find on the sites.
I don't have much experience on that field myself, but I want to learn a little more about how I can, for instance, associate information from one site with information in another site. Even though it is structured differently in both places. Does anyone know about libraries that would help me out with that sort of work? Hope I'm being clear.
[]'s Rafael _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe