Re: [Haskell-cafe] Downloading web page in Haskell

20 Nov 2010

      On 10-11-20 02:54 PM, José Romildo Malaquias wrote:
...
In order to download a given web page, I wrote the attached program. The
problem is that the page is not being full downloaded. It is being
somehow intettupted.
The specific website and url
http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne

truncates when the web server chooses the identity encoding (i.e., as 
opposed to compressed ones such as gzip). The server chooses identity 
when your request's Accept-Encoding field specifies identity or simply 
your request has no Accept-Encoding field, such as when you use 
simpleHTTP (getRequest url), curl, wget, elinks.

When the server chooses gzip (its favourite), which is when your 
Accept-Encoding field includes gzip, the received data is complete (but 
then you have to gunzip it yourself). This happens with mainstream 
browsers and W3C's validator at validator.w3.org (which destroys the 
"you need javascript" hypothesis). I haven't tested other compressed 
encodings.

Methodology

My methodology of discovering and confirming this is a great lesson in 
the triumph of the scientific methodology (over the prevailing 
opinionative methodology, for example).

The first step is to confirm or deny a Network.HTTP problem. For a 
maximally controlled experiment, I enter HTTP by hand using nc:

$ nc www.adorocinema.com 80
GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1
Host: www.adorocinema.com
<blank line>

It still truncates, so at least Network.HTTP is not alone. I also try 
elinks. Other people try curl and wget for the same reason and the same 
result.

The second step is to confirm or deny javascript magic. Actually the 
truncation strongly suggests that javascript is not involved: the 
truncation ends with an incomplete end-tag " save.gz
GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1
Host: www.adorocinema.com
Accept-Encoding: gzip
<blank line>
<wait a while>

Now save.gz contains both header and body, and it only makes sense to 
uncompress the body. So edit save.gz to delete the header part.

Applying gunzip to the body will give some "unexpected end of file" 
error. Don't despair. Do this instead:

$ zcat save.gz > save.html

It is still an error but save.html has meaningful and complete content. 
You can examine it. You can load it in a web browser and see. At least, 
it is much longer, and it ends with "</html>" rather than "

Re: [Haskell-cafe] Downloading web page in Haskell

Albert Y. C. Lai