Downloading web page in Haskell

newer
Interactive OpenGL-based graphics...

José Romildo Malaquias

20 Nov 2010 20 Nov '10

7:54 p.m.

In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted. Any clues on how to solve this problem? Romildo

Attachments:

open-url.hs (text/x-haskell — 292 bytes)

Show replies by date

Michael Snoyman

20 Nov 20 Nov

8:15 p.m.

2010/11/20 José Romildo Malaquias :

...

In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted.

Any clues on how to solve this problem?

My guess is that there's a character encoding issue. Another approach would be using the http-enumerator package[1]. The equivalent program is: module Main where import Network.HTTP.Enumerator (simpleHttp) import qualified Data.ByteString.Lazy as L main = do src <- simpleHttp "http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne" L.writeFile "test.html" src L.putStrLn src Michael [1] http://hackage.haskell.org/package/http-enumerator

Don Stewart

8:47 p.m.

michael:

...

2010/11/20 José Romildo Malaquias :

...
In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted.

Any clues on how to solve this problem?

My guess is that there's a character encoding issue. Another approach would be using the http-enumerator package[1]. The equivalent program is:

module Main where

import Network.HTTP.Enumerator (simpleHttp) import qualified Data.ByteString.Lazy as L

main = do src <- simpleHttp "http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne" L.writeFile "test.html" src L.putStrLn src

FWIW, with this url, I get the same problem using the Curl package (via the download-curl): import Network.Curl.Download import qualified Data.ByteString as B main = do edoc <- openURI "http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne" case edoc of Left err -> print err Right doc -> B.writeFile "test.html" doc Not a problem on e.g. http://haskell.org -- Don

Daniel Fischer

9:26 p.m.

On Saturday 20 November 2010 21:47:52, Don Stewart wrote:

...

...
2010/11/20 José Romildo Malaquias :

...
In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted.

Any clues on how to solve this problem?

FWIW, with this url, I get the same problem using the Curl package

Just for the record, wget also gets a truncated (at the same point) file, so it's not a Haskell problem.

...

Not a problem on e.g. http://haskell.org

-- Don

José Romildo Malaquias

9:42 p.m.

On Sat, Nov 20, 2010 at 10:26:49PM +0100, Daniel Fischer wrote:

...

On Saturday 20 November 2010 21:47:52, Don Stewart wrote:

...
...
2010/11/20 José Romildo Malaquias :

...
In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted.

Any clues on how to solve this problem?

FWIW, with this url, I get the same problem using the Curl package

Just for the record, wget also gets a truncated (at the same point) file, so it's not a Haskell problem.

Web browsers like Firefox and Opera does not seem to have the same problem with this web page. I would like to be able to download this page from Haskell. Romildo

Yitzchak Gale

10:10 p.m.

José Romildo Malaquias wrote:

...

Web browsers like Firefox and Opera does not seem to have the same problem with this web page. I would like to be able to download this page from Haskell.

Hi Romildo, This web page serves the head, including a lot of JavaScript, and the first few hundred bytes of the body, then pauses. That causes web browsers to begin loading and executing the JavaScript. Apparently, the site only continues serving the rest of the page if the JavaScript is actually loaded and executed. If not, it aborts. Either intentionally or unintentionally, that effectively prevents naive scripts from accessing the page. Cute technique. So if you don't want to honor the site author's intention not to allow scripts to load the page, try looking through the JavaScript and find out what you need to do to get the page to continue loading. However, if the site author is very determined to stop you, the JavaScript will be obfuscated or encrypted, which would make this an annoying task. Good luck, Yitz

Sterling Clover

11:51 p.m.

On Nov 20, 2010, at 5:10 PM, Yitzchak Gale wrote:

...

José Romildo Malaquias wrote:

...
Web browsers like Firefox and Opera does not seem to have the same problem with this web page. I would like to be able to download this page from Haskell.

Hi Romildo,

This web page serves the head, including a lot of JavaScript, and the first few hundred bytes of the body, then pauses. That causes web browsers to begin loading and executing the JavaScript. Apparently, the site only continues serving the rest of the page if the JavaScript is actually loaded and executed. If not, it aborts.

Actually, I think it's just a misconfigured proxy. The curl executable fails, at the same point, but a curl --compressed call succeeds. The curl bindings don't allow you to automatically get and decompress gzip data, so you could either set the accept: gzip header yourself, then pipe the output through the appropriate decompression routine, or, more simply, just get the page via using System.Process to drive the curl binary directly. Cheers, Sterl

Albert Y. C. Lai

21 Nov 21 Nov

12:12 a.m.

On 10-11-20 02:54 PM, José Romildo Malaquias wrote:

...

In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted.

The specific website and url http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne truncates when the web server chooses the identity encoding (i.e., as opposed to compressed ones such as gzip). The server chooses identity when your request's Accept-Encoding field specifies identity or simply your request has no Accept-Encoding field, such as when you use simpleHTTP (getRequest url), curl, wget, elinks. When the server chooses gzip (its favourite), which is when your Accept-Encoding field includes gzip, the received data is complete (but then you have to gunzip it yourself). This happens with mainstream browsers and W3C's validator at validator.w3.org (which destroys the "you need javascript" hypothesis). I haven't tested other compressed encodings. Methodology My methodology of discovering and confirming this is a great lesson in the triumph of the scientific methodology (over the prevailing opinionative methodology, for example). The first step is to confirm or deny a Network.HTTP problem. For a maximally controlled experiment, I enter HTTP by hand using nc: $ nc www.adorocinema.com 80 GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1 Host: www.adorocinema.com <blank line> It still truncates, so at least Network.HTTP is not alone. I also try elinks. Other people try curl and wget for the same reason and the same result. The second step is to confirm or deny javascript magic. Actually the truncation strongly suggests that javascript is not involved: the truncation ends with an incomplete end-tag " save.gz GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1 Host: www.adorocinema.com Accept-Encoding: gzip <blank line> <wait a while> Now save.gz contains both header and body, and it only makes sense to uncompress the body. So edit save.gz to delete the header part. Applying gunzip to the body will give some "unexpected end of file" error. Don't despair. Do this instead: $ zcat save.gz > save.html It is still an error but save.html has meaningful and complete content. You can examine it. You can load it in a web browser and see. At least, it is much longer, and it ends with "</html>" rather than "

Yitzchak Gale

1:11 a.m.

Albert Y. C. Lai wrote:

...

...truncates when the web server chooses the identity encoding The server chooses identity when your request's Accept-Encoding field specifies identity or simply your request has no Accept-Encoding field

Excellent work!

...

My methodology of discovering and confirming this is a great lesson in the triumph of the scientific methodology (over the prevailing opinionative methodology, for example).

Haha, indeed!

...

Actually the truncation strongly suggests that javascript is not involved: the truncation ends with an incomplete end-tag "

Well, no, the theory was that the server sends some random number of bytes from the body to ensure that the browser starts loading the scripts in the head. So it could stop anywhere. In the end, I think you didn't really need the W3C validator. You also could have triangulated on the headers sent by your own browser. So, there you have it, folks. The Haskell community debugs a broken web server, without being asked, and without access to the server.

Albert Y. C. Lai

2:45 a.m.

Most likely you also have the zlib package (cabal-install needs it), so let's use it. Attached therefore.hs

5337

Age (days ago)

5338

Last active (days ago)

List overview

Download

9 comments

7 participants

participants (7)

Albert Y. C. Lai
Daniel Fischer
Don Stewart
José Romildo Malaquias
Michael Snoyman
Sterling Clover
Yitzchak Gale

Downloading web page in Haskell

tags

participants (7)