
I'd like to write a very simple Haskell script that when given a URL, looks up the page, and returns a string of HTML. I don't see an HTTP library in the standard libs, and the one in Hackage requires Windows machines have GHC and MinGW to be installed and in the PATH. Is there a simple way to get the contents of a webpage using Haskell on a Windows box? Thanks, Greg

I'd like to write a very simple Haskell script that when given a URL, looks up the page, and returns a string of HTML. I don't see an HTTP library in the standard libs, and the one in Hackage requires Windows machines have GHC and MinGW to be installed and in the PATH.
Is there a simple way to get the contents of a webpage using Haskell on a Windows box?
This isn't exactly what you want, but it gets you partway there. Not sure if LineBuffering or NoBuffering is the best option. Line buffering should be fine for just text output, but if you request a binary object (like an image) then you have to read exactly the number of bytes specified, and no more. Alistair module Main where import System.IO import Network main = client "www.haskell.org" 80 "/haskellwiki/Haskell" client server port page = do h <- connectTo server (PortNumber port) hSetBuffering h NoBuffering putStrLn "send request" hPutStrLn h ("GET " ++ page ++ "\r") hPutStrLn h "\r" hPutStrLn h "\r" putStrLn "wait for response" readResponse h putStrLn "" readResponse h = do closed <- hIsClosed h eof <- hIsEOF h if closed || eof then return () else do c <- hGetChar h putChar c readResponse h

Hi,
I've often wondered the same as the above poster. Something like
readWebPage (in the same style as readFile) would be a really handy
function. Do no libraries provide this?
(if not, can one start providing it? MissingH?)
Thanks
Neil
On 1/18/07, Alistair Bayley
I'd like to write a very simple Haskell script that when given a URL, looks up the page, and returns a string of HTML. I don't see an HTTP library in the standard libs, and the one in Hackage requires Windows machines have GHC and MinGW to be installed and in the PATH.
Is there a simple way to get the contents of a webpage using Haskell on a Windows box?
This isn't exactly what you want, but it gets you partway there. Not sure if LineBuffering or NoBuffering is the best option. Line buffering should be fine for just text output, but if you request a binary object (like an image) then you have to read exactly the number of bytes specified, and no more.
Alistair
module Main where
import System.IO import Network
main = client "www.haskell.org" 80 "/haskellwiki/Haskell"
client server port page = do h <- connectTo server (PortNumber port) hSetBuffering h NoBuffering putStrLn "send request" hPutStrLn h ("GET " ++ page ++ "\r") hPutStrLn h "\r" hPutStrLn h "\r" putStrLn "wait for response" readResponse h putStrLn ""
readResponse h = do closed <- hIsClosed h eof <- hIsEOF h if closed || eof then return () else do c <- hGetChar h putChar c readResponse h _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Alistair Bayley wrote:
I'd like to write a very simple Haskell script that when given a URL, looks up the page, and returns a string of HTML. I don't see an HTTP library in the standard libs...
Neil Mitchell wrote:
MissingH?
MissingPy. It would be great to have a full-featured native library that handles all of the subtleties - redirects, cookies, etc. But it could be that the most robust solution currently is still the following: (Using -package MissingPY) import Python.Interpreter import Python.Objects main = do py_initialize pyImport "urllib2" ... getHtml :: String -> IO String getHtml url = do urlObj <- toPyObject url handle <- callByName "urllib2.urlopen" [urlObj] [] r <- getattr handle "read" pyObject_Call r [] [] >>= strOf Regards, Yitz

Hi Alistair,
Is there a simple way to get the contents of a webpage using Haskell on a Windows box?
This isn't exactly what you want, but it gets you partway there. Not sure if LineBuffering or NoBuffering is the best option. Line buffering should be fine for just text output, but if you request a binary object (like an image) then you have to read exactly the number of bytes specified, and no more.
This works great for haskell.org, unfortunately it doesn't work as well with the rest of the web universe. With www.google.com I get: Program error: <handle>: IO.hGetChar: illegal operation With www.slashdot.org I get: 501 Not Implemented returned www.msnbc.msn.com works fine. Any ideas why? Are there any alternatives to read in a file off the internet (i.e. wget but as a library) Thanks Neil

On Sunday 28 January 2007 09:14, Neil Mitchell wrote:
Hi Alistair,
Is there a simple way to get the contents of a webpage using Haskell on a Windows box?
This isn't exactly what you want, but it gets you partway there. Not sure if LineBuffering or NoBuffering is the best option. Line buffering should be fine for just text output, but if you request a binary object (like an image) then you have to read exactly the number of bytes specified, and no more.
This works great for haskell.org, unfortunately it doesn't work as well with the rest of the web universe.
With www.google.com I get: Program error: <handle>: IO.hGetChar: illegal operation
With www.slashdot.org I get: 501 Not Implemented returned
www.msnbc.msn.com works fine.
Any ideas why?
At the very least it's missing the HTTP version on the request line, and you almost always need to send a Host header. For a start you could try changing client to: client server port page = do h <- connectTo server (PortNumber port) hSetBuffering h NoBuffering putStrLn "send request" hPutStrLn h ("GET " ++ page ++ " HTTP/1.1\r") hPutStrLn h ("Host: " ++ server ++ "\r") hPutStrLn h "\r" hPutStrLn h "\r" putStrLn "wait for response" readResponse h putStrLn "" Note that I haven't tried this, or the rest of Alistair code at all, so the usual 30 day money back guarantee doesn't apply. It certainly won't handle redirects.
Are there any alternatives to read in a file off the internet (i.e. wget but as a library)
The http library sort of works most of the time, but there are several bugs that cause it to fail on many 'in the wild' webservers. HXT has a wrapper around a command line invocation of cURL. It works better. There is still a problem with redirects, but thats an easy enough fix. I doubt that it would be very easy to extract it from the surrounding HXT framework though. It would be nice to have a binding to libcurl. Daniel

Hi Daniel
Note that I haven't tried this, or the rest of Alistair code at all, so the usual 30 day money back guarantee doesn't apply. It certainly won't handle redirects.
Thanks, it certainly gets more things, but has a nasty habit of taking a very long time in Hugs on certain URLs: research.microsoft.com/, www.cs.york.ac.uk/ And on some urls, ie http://research.microsoft.com/~simonpj/, it still ends up with IO.hGetChar: illegal operation Any ideas why? Thanks Neil

On Sunday 28 January 2007 10:53, Neil Mitchell wrote:
Thanks, it certainly gets more things, but has a nasty habit of taking a very long time in Hugs on certain URLs:
research.microsoft.com/,
Looks like IIS is waiting until it receives a Connection header, a bit of a variation from spec I think... Adding in hPutStrLn h ("Connection: close\r\n") or hPutStrLn h ("Connection: keep-alive\r\n") as appropriate should sort that.
www.cs.york.ac.uk/
This is responding with a 302, the resource has been found but is temporarily at another location indicated in the responses Location header. So, now you'll have to start parsing responses. In this case the Location header is www.cs.york.ac.uk/public.php
And on some urls, ie http://research.microsoft.com/~simonpj/, it still ends up with IO.hGetChar: illegal operation
Any ideas why?
Hmmm, if you putStrLn the values of closed and eof it looks to be hanging during the eof check. Don't know why. Oh yeah, all the carriage-returns should be carriage-return line-feeds from memory. Not that that seems to help with this problem. The cheap and cheerful solution might be to invoke cURL. Daniel

Daniel McAllansmith wrote:
The cheap and cheerful solution might be to invoke cURL.
Or MissingPy. The bottom line is that URL loading is not the same as HTTP. It is higher level. While Haskell does have a nice HTTP library, it does not have a URL loading library yet as far as I can see from this thread. -Yitz

Hi Daniel,
Adding in hPutStrLn h ("Connection: close\r\n") or hPutStrLn h ("Connection: keep-alive\r\n") as appropriate should sort that.
Works like a charm.
This is responding with a 302, the resource has been found but is temporarily at another location indicated in the responses Location header. So, now you'll have to start parsing responses. In this case the Location header is www.cs.york.ac.uk/public.php
I didn't get as far as getting the 302, but it works now.
The cheap and cheerful solution might be to invoke cURL.
My standard solution was to invoke wget, but a Haskell solution would be nicer. For my purpose following redirects etc. isn't required, so thanks very much for your help. I will be releasing this function as part of a library shortly, so will be giving you credit for your help! Thanks Neil

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Neil Mitchell
My standard solution was to invoke wget, but a Haskell solution would be nicer. For my purpose following redirects etc. isn't required, so thanks very much for your help. I will be releasing this function as part of a library shortly, so will be giving you credit for your help!
Neil
Good god, no! The code was merely meant to illustrate how a really basic HTTP GET might work. It certainly doesn't deal with a lot of the additional cases, like redirects and resource moves, and non-standards-compliant HTTP servers. I'm no HTTP expert, so for all I know this example code is likely non-standards-compliant too. It really only works for a very straightforward text file GET, and there's no exception or error handling, or any of the more interesting cases. You may or may not know that there are a large number of webserver implementations which do not respect the HTTP standards (1.0 or 1.1), and HTTP clients (like web browsers) have to go to some lengths in order to get sensible responses out of most of them. So pure Haskell code (i.e. no FFI calls to a C HTTP lib) should probably consider a large subset of these non-standard behaviours, too. OTOH, if your needs really are very simple, then fine. But be aware that "doing the right thing" with real-world HTTP responses can be a can-o'-worms. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Neil Mitchell wrote:
I will be releasing this function as part of a library shortly
Alistair Bayley wrote:
no! The code was merely meant to illustrate how a really basic HTTP GET might work. It certainly doesn't deal with a lot of the additional cases, like redirects and resource moves, and non-standards-compliant HTTP servers... there are a large number of webserver implementations which do not respect the HTTP standards (1.0 or 1.1), and HTTP clients (like web browsers) have to go to some lengths in order to get sensible responses out of most of them... if your needs really are very simple, then fine. But be aware that "doing the right thing" with real-world HTTP responses can be a can-o'-worms.
Let's not complicate things too much at the HTTP level. Low-level HTTP is a simple protocol, not hard to implement. You send a request with headers and data, and get a like response. Possibly reuse the connection. That's it. HTTP is useful for many things besides just loading web pages from large public servers. We need a simple, easy to use module that just implements HTTP. I think we have that, or we are close. Loading URLs on the web is an entirely different matter. There is a whole layer of logic that is needed to deal with the mess out there. It builds not just on HTTP, but on various other standard and non-standard protocols. URL loading is a hard problem, but usable solutions are well-known and available. I would suggest that we not re-invent the wheel here. If we want a pure Haskell solution - and that would be nice - we should start with an existing code base that is widely used, stable, and not too messy. Then re-write it in Haskell. Otherwise, just keep spawning wget or cUrl, or use MissingPy. But please don't confuse concerns by mixing URL-loading logic into the HTTP library. They made that mistake in Perl in the early days of the web, before it was clear what was about to happen. There is no reason for us to repeat the mistake. Thanks, Yitz

On Jan 29, 2007, at 11:11 , Yitzchak Gale wrote:
Neil Mitchell wrote:
I will be releasing this function as part of a library shortly
Alistair Bayley wrote:
no! The code was merely meant to illustrate how a really basic HTTP GET might work. It certainly doesn't deal with a lot of the additional cases, like redirects and resource moves, and non-standards-compliant HTTP servers... there are a large number of webserver implementations which do not respect the HTTP standards (1.0 or 1.1), and HTTP clients (like web browsers) have to go to some lengths in order to get sensible responses out of most of them... if your needs really are very simple, then fine. But be aware that "doing the right thing" with real-world HTTP responses can be a can-o'-worms.
Let's not complicate things too much at the HTTP level. Low-level HTTP is a simple protocol, not hard to implement. You send a request with headers and data, and get a like response. Possibly reuse the connection. That's it. HTTP is useful for many things besides just loading web pages from large public servers. We need a simple, easy to use module that just implements HTTP. I think we have that, or we are close.
Loading URLs on the web is an entirely different matter. There is a whole layer of logic that is needed to deal with the mess out there. It builds not just on HTTP, but on various other standard and non-standard protocols.
URL loading is a hard problem, but usable solutions are well-known and available. I would suggest that we not re-invent the wheel here. If we want a pure Haskell solution - and that would be nice - we should start with an existing code base that is widely used, stable, and not too messy. Then re-write it in Haskell. Otherwise, just keep spawning wget or cUrl, or use MissingPy.
But please don't confuse concerns by mixing URL-loading logic into the HTTP library.
They made that mistake in Perl in the early days of the web, before it was clear what was about to happen. There is no reason for us to repeat the mistake.
Status report for the HTTP package (http://haskell.org/http/): The Network.HTTP module is an implementation of HTTP itself. The Network.Browser module sits on top of that and does more high-level things, such as cookie handling. I maintain the current HTTP package [1], but I haven't really done much maintenance, and I have only gotten a few patches submitted. Much of the code hasn't even been touched since Warrick Gray disappeared around 2002. The reason for this state of affairs is that I hardly use the library myself, and few others have contributed to it. In fact, I just now went to have a look at the code and noticed that until now, the most important functions in Network.Browser did not show up in the Haddock documentation because of missing type signatures. This library needs a more dedicated maintainer and more contributors. Do we have any candidates in this thread? Here's a list of TODO items off the top of my head to get you started: - Add a layer (on top of Network.Browser?) for simple get and post requests, with an interface something like: get :: URI -> IO String post :: URI -> String -> IO String - Switch to use lazy ByteStrings - Better API for Network.Browser? - Move HTTP authentication stuff to a separate module? - Move cookie stuff to a separate module? Unify with the similar code in the cgi package (Network.CGI.HTTP.Cookie)? - Use MD5 and Base64 from Dominic's new nimbler crypto package (see http://www.haskell.org/haskellwiki/Crypto_Library_Proposal) - Use the non-deprecated Network.URI API. - Implement HTTPS support. /Björn

I've fallen off the pace on this thread so this is a composite reply, mainly to Bjorn, Brad and Yitzchak... I would also like to express my gratitude for the work that Bjorn, and all the others involved, have done on the http library. I certainly appreciated having it available for use. I agree that a full-featured HTTP library is important for Haskell. And resource loading and serving as separate concerns on top of this. The HTTP protocol is reasonably straight forward, and pretty well specified, so standards compliance should be achievable. But to actually be useful in a lot of situations standards compliance is insufficient, many HTTP applications seem to be pretty poor efforts at compliance and being able to handle quirks is a necessity. I think any library also needs to be robust in the face of malicious input. One of the things that makes HTTP useful is that it is extendable. Any library should expose this in a principled manner. A wrapper around, say, cURL or a binding to libcurl is not a great solution in my opinion. It would be cheap and provides more functionality than the current Haskell http library but lacks the separation of protocol and processing and lacks the extension aspects of HTTP. And obviously it has foreign dependencies. So I'd really like to see a pure Haskell option. Unfortunately I don't agree that we are close to one now. I'm not enthused about http being the basis for 'the feature-complete' library. I'd like a library that at least has more static checks, is open to extension according to the protocol, allows subversion of standards compliance for case by case quirk handling and can be tuned to handle malicious input. I would also like to contribute, and could enumerate my (half-baked) ideas and opinions if they are of interest. Cheers Daniel

Hi
My standard solution was to invoke wget, but a Haskell solution would be nicer. For my purpose following redirects etc. isn't required, so thanks very much for your help. I will be releasing this function as part of a library shortly, so will be giving you credit for your help!
Good god, no! The code was merely meant to illustrate how a really basic HTTP GET might work. It certainly doesn't deal with a lot of the additional cases, like redirects and resource moves, and non-standards-compliant HTTP servers. I'm no HTTP expert, so for all I know this example code is likely non-standards-compliant too. It really only works for a very straightforward text file GET, and there's no exception or error handling, or any of the more interesting cases.
But at the same time its the best solution available without going to much pain and C libraries. I would love something more robust and equally simple, but haven't found it. The library I am writing is one to parse and extract information from HTML documents, the fact that I can now download web pages merely makes the examples I give more interesting - its not the fundamental essence of the library. Thanks Neil

Greg Fitzgerald wrote:
I'd like to write a very simple Haskell script that when given a URL, looks up the page, and returns a string of HTML. I don't see an HTTP library in the standard libs, and the one in Hackage requires Windows machines have GHC and MinGW to be installed and in the PATH.
Is there a simple way to get the contents of a webpage using Haskell on a Windows box?
I agree with other posters that the Network.HTTP API should be made more easy to use. I will happily accept patches for this. The HTTP package homepage (http://www.haskell.org/http/) mentioned the GHC and MinGW requirements you cite, but those seemed to be out of date. You should be able to use plain Cabal to install the HTTP package for any recent GHC or Hugs. /Björn

Alistair, Neil, Brad, Yitzchak, Bjorn,
Thanks all for your help.
-Greg
On 1/19/07, Björn Bringert
Greg Fitzgerald wrote:
I'd like to write a very simple Haskell script that when given a URL, looks up the page, and returns a string of HTML. I don't see an HTTP library in the standard libs, and the one in Hackage requires Windows machines have GHC and MinGW to be installed and in the PATH.
Is there a simple way to get the contents of a webpage using Haskell on a Windows box?
I agree with other posters that the Network.HTTP API should be made more easy to use. I will happily accept patches for this.
The HTTP package homepage (http://www.haskell.org/http/) mentioned the GHC and MinGW requirements you cite, but those seemed to be out of date. You should be able to use plain Cabal to install the HTTP package for any recent GHC or Hugs.
/Björn
participants (8)
-
Alistair Bayley
-
Bayley, Alistair
-
Bjorn Bringert
-
Björn Bringert
-
Daniel McAllansmith
-
Greg Fitzgerald
-
Neil Mitchell
-
Yitzchak Gale