Decompressing and http-enumerator

older
Re: [Haskell-cafe] [xmonad] Cabal...

Michael Snoyman

29 Aug 2011 29 Aug '11

8:08 a.m.

Hi all, Erik just opened an issue on Github[1] that affected me very recently as well when writing some automated Hackage checking code[2]. The issue is that http-enumerator sees the content-encoding header and decompresses the tarball, returning an uncompressed tarfile. I can avoid this with rawBody = False, but that's not a real solution, since that also disables chunked response handling. So one possible solution is to just add an option to never decompress response bodies, but that's a bit of a hack. The real question is: what's the correct way to handle these tarballs? Web browsers seem to know not to decompress them, but AFAICT http-enumerator is correctly following the spec by decompressing. Another possibility is to determine whether or not to decompress based on the content type, using either a white or black list (e.g., never compress TAR files, always decompress HTML files, etc). I'm open to suggestions here. Michael [1] https://github.com/snoyberg/http-enumerator/issues/30 [2] https://github.com/yesodweb/maintenance/blob/master/release/sdist-check.hs

Show replies by date

Brandon Allbery

29 Aug 29 Aug

8:52 a.m.

On Mon, Aug 29, 2011 at 04:08, Michael Snoyman wrote:

...

So one possible solution is to just add an option to never decompress response bodies, but that's a bit of a hack. The real question is: what's the correct way to handle these tarballs? Web browsers seem to know not to decompress them, but AFAICT http-enumerator is correctly following the spec by decompressing. Another possibility is to

"Seem to" is pretty much correct; it took years for some browsers to reliably handle them correctly. (Anyone else remember Mozilla saving compressed tarballs uncompressed?) -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

Erik de Castro Lopo

10:16 a.m.

Brandon Allbery wrote:

...

On Mon, Aug 29, 2011 at 04:08, Michael Snoyman wrote:

...
So one possible solution is to just add an option to never decompress response bodies, but that's a bit of a hack. The real question is: what's the correct way to handle these tarballs? Web browsers seem to know not to decompress them, but AFAICT http-enumerator is correctly following the spec by decompressing. Another possibility is to

"Seem to" is pretty much correct; it took years for some browsers to reliably handle them correctly. (Anyone else remember Mozilla saving compressed tarballs uncompressed?)

Yes, it was a pain in the neck. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Gregory Collins

11:21 a.m.

On Mon, Aug 29, 2011 at 10:08 AM, Michael Snoyman wrote:

...

Hi all,

Erik just opened an issue on Github[1] that affected me very recently as well when writing some automated Hackage checking code[2]. The issue is that http-enumerator sees the content-encoding header and decompresses the tarball, returning an uncompressed tarfile. I can avoid this with rawBody = False, but that's not a real solution, since that also disables chunked response handling.

A web server should not be setting "Content-encoding: gzip" on a .tar.gz file. I agree that http-enumerator is correctly following the spec by decompressing. If you decide to implement a workaround for this, the only reasonable thing I can think of is adding a "ignoreContentEncoding" knob the user can twiddle to violate spec. G -- Gregory Collins

Michael Snoyman

1:37 p.m.

On Mon, Aug 29, 2011 at 2:21 PM, Gregory Collins wrote:

...

On Mon, Aug 29, 2011 at 10:08 AM, Michael Snoyman wrote:

...
Hi all,

Erik just opened an issue on Github[1] that affected me very recently as well when writing some automated Hackage checking code[2]. The issue is that http-enumerator sees the content-encoding header and decompresses the tarball, returning an uncompressed tarfile. I can avoid this with rawBody = False, but that's not a real solution, since that also disables chunked response handling.

A web server should not be setting "Content-encoding: gzip" on a .tar.gz file. I agree that http-enumerator is correctly following the spec by decompressing.

If you decide to implement a workaround for this, the only reasonable thing I can think of is adding a "ignoreContentEncoding" knob the user can twiddle to violate spec.

I'm wondering what the most appropriate way to handle this is. Maybe a dontDecompress record, looking like: type ContentType = ByteString dontDecompress :: ContentType -> Bool Then browser behavior would be: browserDecompress = (== "application/x-tar") and current behavior would be: defaultDecompress = const False I don't have any strong opinions here... Michael

Conrad Parker

11:26 p.m.

On Aug 29, 2011 9:39 PM, "Michael Snoyman" wrote:

...

On Mon, Aug 29, 2011 at 2:21 PM, Gregory Collins wrote:

...
On Mon, Aug 29, 2011 at 10:08 AM, Michael Snoyman

wrote:

...

...
...
Hi all,

Erik just opened an issue on Github[1] that affected me very recently as well when writing some automated Hackage checking code[2]. The issue is that http-enumerator sees the content-encoding header and decompresses the tarball, returning an uncompressed tarfile. I can avoid this with rawBody = False, but that's not a real solution, since that also disables chunked response handling.

A web server should not be setting "Content-encoding: gzip" on a .tar.gz file. I agree that http-enumerator is correctly following the spec by decompressing.

If you decide to implement a workaround for this, the only reasonable thing I can think of is adding a "ignoreContentEncoding" knob the user can twiddle to violate spec.

I'm wondering what the most appropriate way to handle this is. Maybe a dontDecompress record, looking like:

type ContentType = ByteString dontDecompress :: ContentType -> Bool

Then browser behavior would be:

browserDecompress = (== "application/x-tar")

and current behavior would be:

defaultDecompress = const False

I don't have any strong opinions here...

I agree with Gregory's suggestion of an API that allows an application to see the data prior to decoding the Content-Encoding. It could be tagged with the name of the content-coding, and there could be a generic decode function (ie. the library already knows what needs to be done to decode, so there's no need for the application to go looking up the decode function by name). Conrad.

Erik de Castro Lopo

30 Aug 30 Aug

3:27 a.m.

Michael Snoyman wrote:

...

I'm wondering what the most appropriate way to handle this is.

Just to get my thoughts in order I'll back track a little. In the HTTP repsonse, we have two header fields, content-type and content-encoding. For the later (which may be absent) we can have encodings of gzip or chunked (possibly others). Some examples: content-type content-encoding current practice =================================================== text/html gzip gunzip it in H.E. text/html chunked unchunk it in H.E. For the case where H.E might be used as part of a HTTP proxy we also have a rawBody option that disables both the unchunking and the gunzipping. This rawBody functionality works now; I'm using it. We now add to the above a file type where the content-type is application/x-tar and the content-encoding is gzipped but from the filename part of the URL, a user may well expect that we get a tar.gz file but where H.E. currently gunzips it on the fly. So, on to your suggestion:

...

Maybe a dontDecompress record, looking like:

type ContentType = ByteString dontDecompress :: ContentType -> Bool

Then browser behavior would be:

browserDecompress = (== "application/x-tar")

and current behavior would be:

defaultDecompress = const False

I think we should invert the logic of this (to avoid double negatives) so we have: type ContentType = ByteString decompress :: ContentType -> Bool browserDecompress = (/== "application/x-tar") defaultDecompress = const True Was the idea that this decompress field then gets added to the Request record? If so, would simpleHttp be modified to be: simpleHttp :: String -> (ContentType -> Bool) -> m L.ByteString and exporting both browserDecompress and defaultDecompress so they can be used as two sane defaults for the second parameter? Cheers, Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Michael Snoyman

4:11 a.m.

On Tue, Aug 30, 2011 at 6:27 AM, Erik de Castro Lopo wrote:

...

Michael Snoyman wrote:

...
I'm wondering what the most appropriate way to handle this is.

Just to get my thoughts in order I'll back track a little.

In the HTTP repsonse, we have two header fields, content-type and content-encoding. For the later (which may be absent) we can have encodings of gzip or chunked (possibly others).

Actually, chunked would go under transfer-encoding, but I think that's irrelevant for the rest of this discussion.

...

Some examples:

content-type content-encoding current practice =================================================== text/html gzip gunzip it in H.E. text/html chunked unchunk it in H.E.

For the case where H.E might be used as part of a HTTP proxy we also have a rawBody option that disables both the unchunking and the gunzipping. This rawBody functionality works now; I'm using it.

We now add to the above a file type where the content-type is application/x-tar and the content-encoding is gzipped but from the filename part of the URL, a user may well expect that we get a tar.gz file but where H.E. currently gunzips it on the fly.

So, on to your suggestion:

...
Maybe a dontDecompress record, looking like:

type ContentType = ByteString dontDecompress :: ContentType -> Bool

Then browser behavior would be:

browserDecompress = (== "application/x-tar")

and current behavior would be:

defaultDecompress = const False

I think we should invert the logic of this (to avoid double negatives) so we have:

type ContentType = ByteString decompress :: ContentType -> Bool

browserDecompress = (/== "application/x-tar") defaultDecompress = const True

No objections from me.

...

Was the idea that this decompress field then gets added to the Request record?

Yes.

...

If so, would simpleHttp be modified to be:

simpleHttp :: String -> (ContentType -> Bool) -> m L.ByteString

and exporting both browserDecompress and defaultDecompress so they can be used as two sane defaults for the second parameter?

I don't want to go this route actually. I think simpleHttp should have the exact same type signature it has route now (thus living up to the name "simple"). It likely makes sense to use browserDecompress as the default for simpleHttp, and defaultDecompress as the default for parseUrl. Though I don't really have a strong opinion on this either. In either case, I'm thinking we should rename defaultDecompress to alwaysDecompress (my mistake to start off with), to properly indicate what it does. Michael

Erik de Castro Lopo

4:48 a.m.

Michael Snoyman wrote:

...

...
I think we should invert the logic of this (to avoid double negatives) so we have:

type ContentType = ByteString decompress :: ContentType -> Bool

browserDecompress = (/== "application/x-tar") defaultDecompress = const True

No objections from me.

...
Was the idea that this decompress field then gets added to the Request record?

Yes.

...
If so, would simpleHttp be modified to be:

simpleHttp :: String -> (ContentType -> Bool) -> m L.ByteString

and exporting both browserDecompress and defaultDecompress so they can be used as two sane defaults for the second parameter?

I don't want to go this route actually. I think simpleHttp should have the exact same type signature it has route now (thus living up to the name "simple"). It likely makes sense to use browserDecompress as the default for simpleHttp, and defaultDecompress as the default for parseUrl. Though I don't really have a strong opinion on this either. In either case, I'm thinking we should rename defaultDecompress to alwaysDecompress (my mistake to start off with), to properly indicate what it does.

Ok, I'll prepare a patch along these lines and submit a github pull request. Cheers, Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Herbert Valerio Riedel

29 Aug 29 Aug

2:28 p.m.

On Mon, 2011-08-29 at 13:21 +0200, Gregory Collins wrote:

...

A web server should not be setting "Content-encoding: gzip" on a .tar.gz file.

Why not? From RFC2616 compliant servers I'd expect a .tar.gz file to have the Content-* headers provide meta-information about the content[1], e.g. Content-Type: application/x-tar Content-Encoding: gzip Transfer-Encoding: chunked If I want to detach the gzip encoding from the "content" (or "entity"), I'd move it to the Transfer-Encoding header[2], e.g.: Content-Type: application/x-tar Transfer-Encoding: gzip, chunked [1]: See RFC2616 sec7.2.1: "Content-Type specifies the media type of the underlying data. Content-Encoding may be used to indicate any additional content codings applied to the data, usually for the purpose of data compression, that are a property of the requested resource." [2]: See RFC2616 sec4.3: "Transfer-Encoding is a property of the message, not of the entity, and thus MAY be added or removed by any application along the request/response chain."

Michael Snoyman

3:45 p.m.

On Mon, Aug 29, 2011 at 5:28 PM, Herbert Valerio Riedel wrote:

...

On Mon, 2011-08-29 at 13:21 +0200, Gregory Collins wrote:

...
A web server should not be setting "Content-encoding: gzip" on a .tar.gz file.

Why not? From RFC2616 compliant servers I'd expect a .tar.gz file to have the Content-* headers provide meta-information about the content[1], e.g.

Content-Type: application/x-tar Content-Encoding: gzip Transfer-Encoding: chunked

If I want to detach the gzip encoding from the "content" (or "entity"), I'd move it to the Transfer-Encoding header[2], e.g.:

Content-Type: application/x-tar Transfer-Encoding: gzip, chunked

[1]: See RFC2616 sec7.2.1: "Content-Type specifies the media type of the underlying data. Content-Encoding may be used to indicate any additional content codings applied to the data, usually for the purpose of data compression, that are a property of the requested resource."

[2]: See RFC2616 sec4.3: "Transfer-Encoding is a property of the message, not of the entity, and thus MAY be added or removed by any application along the request/response chain."

"chunked" is the only valid transfer-encoding[1], while gzip must be specified on the content-encoding header[2]. For a simple example of these two, look at the response headers from Haskellers[3] in something like Chrome developer tools. [1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6 [2] http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5 [3] http://www.haskellers.com/

Gregory Collins

4:32 p.m.

On Mon, Aug 29, 2011 at 4:28 PM, Herbert Valerio Riedel wrote:

...

On Mon, 2011-08-29 at 13:21 +0200, Gregory Collins wrote:

...
A web server should not be setting "Content-encoding: gzip" on a .tar.gz file.

Why not? From RFC2616 compliant servers I'd expect a .tar.gz file to have the Content-* headers provide meta-information about the content[1], e.g.

Content-Type: application/x-tar Content-Encoding: gzip Transfer-Encoding: chunked

The way I would interpret this is: this MIME body is a TAR file which has been gzip-encoded for the purpose of efficiency in transmission. When I ask the library for the body contents as an octet stream, I would expect to get the TAR file contents, uncompressed. This is how it works when you send text/html with "Content-Encoding: gzip", I don't understand why it should be different with a .tar.gz file. If you wanted the MIME body to be passed through unmolested (i.e. you expect the octet stream to actually be in gzip-compressed TAR format), I would expect that you set "Content-Type: application/x-tgz" without a Content-Encoding. But that's just my interpretation both of standard practice and of the spec.

...

If I want to detach the gzip encoding from the "content" (or "entity"), I'd move it to the Transfer-Encoding header[2], e.g.:

Content-Type: application/x-tar Transfer-Encoding: gzip, chunked

As Michael mentioned, that isn't how those headers are interpreted. G -- Gregory Collins

5065

Age (days ago)

5066

Last active (days ago)

List overview

Download

11 comments

6 participants

participants (6)

Brandon Allbery
Conrad Parker
Erik de Castro Lopo
Gregory Collins
Herbert Valerio Riedel
Michael Snoyman

Decompressing and http-enumerator

tags

participants (6)