
Hello. I am porting to Haskell a Java application I have written to manage collections of movies. Currently the application has an option to indirectly import movie data from web pages. For that first the user should access the page in a web browser. Then the user should copy the rendered text in the web browser into an import window in my application and click an "import" button. In response the application parses the given text and collects any relevant data it knows about, using regular expressions. For instance, to get the director information from a movie in the AllCenter web site I use the following regular expression: ^Direção:\s+(.+)$ I want to modify this scheme in order to eliminate the need to copy the rendered text from a web browser. Instead my application should download and parse the HTML page directly. Which libraries are available in Haskell that would make it easy to get content information from a HTML document, in the way described above? Regards, Romildo

Hello José, I've done a similar task some weeks ago and I used the Haskell XML Toolbox (hxt) [1] to do this. After learning how to program with arrows it was quite easy to write arrows that extract the relevant information from XML data. Regards, Martin. [1] http://hackage.haskell.org/package/hxt José Romildo Malaquias schrieb:
Hello.
I am porting to Haskell a Java application I have written to manage collections of movies.
Currently the application has an option to indirectly import movie data from web pages. For that first the user should access the page in a web browser. Then the user should copy the rendered text in the web browser into an import window in my application and click an "import" button. In response the application parses the given text and collects any relevant data it knows about, using regular expressions.
For instance, to get the director information from a movie in the AllCenter web site I use the following regular expression:
^Direção:\s+(.+)$
I want to modify this scheme in order to eliminate the need to copy the rendered text from a web browser. Instead my application should download and parse the HTML page directly.
Which libraries are available in Haskell that would make it easy to get content information from a HTML document, in the way described above?
Regards,
Romildo _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

José Romildo Malaquias wrote:
Currently the application has an option to indirectly import movie data from web pages. For that first the user should access the page in a web browser. Then the user should copy the rendered text in the web browser into an import window in my application and click an "import" button. In response the application parses the given text and collects any relevant data it knows about, using regular expressions.
For instance, to get the director information from a movie in the AllCenter web site I use the following regular expression:
^Direção:\s+(.+)$
I want to modify this scheme in order to eliminate the need to copy the rendered text from a web browser. Instead my application should download and parse the HTML page directly.
Which libraries are available in Haskell that would make it easy to get content information from a HTML document, in the way described above?
To parse HTML documents, I've had success with TagSoup in the past. You can take a look at the HTTP package to download the HTML from the server. Both packages are available from Hackage. HTH, Jochem -- Jochem Berndsen | jochem@functor.nl

Hello, I would use TagSoup: http://community.haskell.org/~ndm/tagsoup/ It is was designed for exactly this type of thing. - jeremy
participants (4)
-
Jeremy Shaw
-
Jochem Berndsen
-
José Romildo Malaquias
-
Martin Huschenbett