Decompressing and http-enumerator

Hi all, Erik just opened an issue on Github[1] that affected me very recently as well when writing some automated Hackage checking code[2]. The issue is that http-enumerator sees the content-encoding header and decompresses the tarball, returning an uncompressed tarfile. I can avoid this with rawBody = False, but that's not a real solution, since that also disables chunked response handling. So one possible solution is to just add an option to never decompress response bodies, but that's a bit of a hack. The real question is: what's the correct way to handle these tarballs? Web browsers seem to know not to decompress them, but AFAICT http-enumerator is correctly following the spec by decompressing. Another possibility is to determine whether or not to decompress based on the content type, using either a white or black list (e.g., never compress TAR files, always decompress HTML files, etc). I'm open to suggestions here. Michael [1] https://github.com/snoyberg/http-enumerator/issues/30 [2] https://github.com/yesodweb/maintenance/blob/master/release/sdist-check.hs

On Mon, Aug 29, 2011 at 04:08, Michael Snoyman
So one possible solution is to just add an option to never decompress response bodies, but that's a bit of a hack. The real question is: what's the correct way to handle these tarballs? Web browsers seem to know not to decompress them, but AFAICT http-enumerator is correctly following the spec by decompressing. Another possibility is to
"Seem to" is pretty much correct; it took years for some browsers to reliably handle them correctly. (Anyone else remember Mozilla saving compressed tarballs uncompressed?) -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

Brandon Allbery wrote:
On Mon, Aug 29, 2011 at 04:08, Michael Snoyman
wrote: So one possible solution is to just add an option to never decompress response bodies, but that's a bit of a hack. The real question is: what's the correct way to handle these tarballs? Web browsers seem to know not to decompress them, but AFAICT http-enumerator is correctly following the spec by decompressing. Another possibility is to
"Seem to" is pretty much correct; it took years for some browsers to reliably handle them correctly. (Anyone else remember Mozilla saving compressed tarballs uncompressed?)
Yes, it was a pain in the neck. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

On Mon, Aug 29, 2011 at 10:08 AM, Michael Snoyman
Hi all,
Erik just opened an issue on Github[1] that affected me very recently as well when writing some automated Hackage checking code[2]. The issue is that http-enumerator sees the content-encoding header and decompresses the tarball, returning an uncompressed tarfile. I can avoid this with rawBody = False, but that's not a real solution, since that also disables chunked response handling.
A web server should not be setting "Content-encoding: gzip" on a
.tar.gz file. I agree that http-enumerator is correctly following the
spec by decompressing.
If you decide to implement a workaround for this, the only reasonable
thing I can think of is adding a "ignoreContentEncoding" knob the user
can twiddle to violate spec.
G
--
Gregory Collins

On Mon, Aug 29, 2011 at 2:21 PM, Gregory Collins
On Mon, Aug 29, 2011 at 10:08 AM, Michael Snoyman
wrote: Hi all,
Erik just opened an issue on Github[1] that affected me very recently as well when writing some automated Hackage checking code[2]. The issue is that http-enumerator sees the content-encoding header and decompresses the tarball, returning an uncompressed tarfile. I can avoid this with rawBody = False, but that's not a real solution, since that also disables chunked response handling.
A web server should not be setting "Content-encoding: gzip" on a .tar.gz file. I agree that http-enumerator is correctly following the spec by decompressing.
If you decide to implement a workaround for this, the only reasonable thing I can think of is adding a "ignoreContentEncoding" knob the user can twiddle to violate spec.
I'm wondering what the most appropriate way to handle this is. Maybe a dontDecompress record, looking like: type ContentType = ByteString dontDecompress :: ContentType -> Bool Then browser behavior would be: browserDecompress = (== "application/x-tar") and current behavior would be: defaultDecompress = const False I don't have any strong opinions here... Michael

On Aug 29, 2011 9:39 PM, "Michael Snoyman"
On Mon, Aug 29, 2011 at 2:21 PM, Gregory Collins
wrote: On Mon, Aug 29, 2011 at 10:08 AM, Michael Snoyman
wrote:
Hi all,
Erik just opened an issue on Github[1] that affected me very recently as well when writing some automated Hackage checking code[2]. The issue is that http-enumerator sees the content-encoding header and decompresses the tarball, returning an uncompressed tarfile. I can avoid this with rawBody = False, but that's not a real solution, since that also disables chunked response handling.
A web server should not be setting "Content-encoding: gzip" on a .tar.gz file. I agree that http-enumerator is correctly following the spec by decompressing.
If you decide to implement a workaround for this, the only reasonable thing I can think of is adding a "ignoreContentEncoding" knob the user can twiddle to violate spec.
I'm wondering what the most appropriate way to handle this is. Maybe a dontDecompress record, looking like:
type ContentType = ByteString dontDecompress :: ContentType -> Bool
Then browser behavior would be:
browserDecompress = (== "application/x-tar")
and current behavior would be:
defaultDecompress = const False
I don't have any strong opinions here...
I agree with Gregory's suggestion of an API that allows an application to see the data prior to decoding the Content-Encoding. It could be tagged with the name of the content-coding, and there could be a generic decode function (ie. the library already knows what needs to be done to decode, so there's no need for the application to go looking up the decode function by name). Conrad.

Michael Snoyman wrote:
I'm wondering what the most appropriate way to handle this is.
Just to get my thoughts in order I'll back track a little. In the HTTP repsonse, we have two header fields, content-type and content-encoding. For the later (which may be absent) we can have encodings of gzip or chunked (possibly others). Some examples: content-type content-encoding current practice =================================================== text/html gzip gunzip it in H.E. text/html chunked unchunk it in H.E. For the case where H.E might be used as part of a HTTP proxy we also have a rawBody option that disables both the unchunking and the gunzipping. This rawBody functionality works now; I'm using it. We now add to the above a file type where the content-type is application/x-tar and the content-encoding is gzipped but from the filename part of the URL, a user may well expect that we get a tar.gz file but where H.E. currently gunzips it on the fly. So, on to your suggestion:
Maybe a dontDecompress record, looking like:
type ContentType = ByteString dontDecompress :: ContentType -> Bool
Then browser behavior would be:
browserDecompress = (== "application/x-tar")
and current behavior would be:
defaultDecompress = const False
I think we should invert the logic of this (to avoid double negatives) so we have: type ContentType = ByteString decompress :: ContentType -> Bool browserDecompress = (/== "application/x-tar") defaultDecompress = const True Was the idea that this decompress field then gets added to the Request record? If so, would simpleHttp be modified to be: simpleHttp :: String -> (ContentType -> Bool) -> m L.ByteString and exporting both browserDecompress and defaultDecompress so they can be used as two sane defaults for the second parameter? Cheers, Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

On Tue, Aug 30, 2011 at 6:27 AM, Erik de Castro Lopo
Michael Snoyman wrote:
I'm wondering what the most appropriate way to handle this is.
Just to get my thoughts in order I'll back track a little.
In the HTTP repsonse, we have two header fields, content-type and content-encoding. For the later (which may be absent) we can have encodings of gzip or chunked (possibly others).
Actually, chunked would go under transfer-encoding, but I think that's irrelevant for the rest of this discussion.
Some examples:
content-type content-encoding current practice =================================================== text/html gzip gunzip it in H.E. text/html chunked unchunk it in H.E.
For the case where H.E might be used as part of a HTTP proxy we also have a rawBody option that disables both the unchunking and the gunzipping. This rawBody functionality works now; I'm using it.
We now add to the above a file type where the content-type is application/x-tar and the content-encoding is gzipped but from the filename part of the URL, a user may well expect that we get a tar.gz file but where H.E. currently gunzips it on the fly.
So, on to your suggestion:
Maybe a dontDecompress record, looking like:
type ContentType = ByteString dontDecompress :: ContentType -> Bool
Then browser behavior would be:
browserDecompress = (== "application/x-tar")
and current behavior would be:
defaultDecompress = const False
I think we should invert the logic of this (to avoid double negatives) so we have:
type ContentType = ByteString decompress :: ContentType -> Bool
browserDecompress = (/== "application/x-tar") defaultDecompress = const True
No objections from me.
Was the idea that this decompress field then gets added to the Request record?
Yes.
If so, would simpleHttp be modified to be:
simpleHttp :: String -> (ContentType -> Bool) -> m L.ByteString
and exporting both browserDecompress and defaultDecompress so they can be used as two sane defaults for the second parameter?
I don't want to go this route actually. I think simpleHttp should have the exact same type signature it has route now (thus living up to the name "simple"). It likely makes sense to use browserDecompress as the default for simpleHttp, and defaultDecompress as the default for parseUrl. Though I don't really have a strong opinion on this either. In either case, I'm thinking we should rename defaultDecompress to alwaysDecompress (my mistake to start off with), to properly indicate what it does. Michael

Michael Snoyman wrote:
I think we should invert the logic of this (to avoid double negatives) so we have:
type ContentType = ByteString decompress :: ContentType -> Bool
browserDecompress = (/== "application/x-tar") defaultDecompress = const True
No objections from me.
Was the idea that this decompress field then gets added to the Request record?
Yes.
If so, would simpleHttp be modified to be:
simpleHttp :: String -> (ContentType -> Bool) -> m L.ByteString
and exporting both browserDecompress and defaultDecompress so they can be used as two sane defaults for the second parameter?
I don't want to go this route actually. I think simpleHttp should have the exact same type signature it has route now (thus living up to the name "simple"). It likely makes sense to use browserDecompress as the default for simpleHttp, and defaultDecompress as the default for parseUrl. Though I don't really have a strong opinion on this either. In either case, I'm thinking we should rename defaultDecompress to alwaysDecompress (my mistake to start off with), to properly indicate what it does.
Ok, I'll prepare a patch along these lines and submit a github pull request. Cheers, Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

On Mon, 2011-08-29 at 13:21 +0200, Gregory Collins wrote:
A web server should not be setting "Content-encoding: gzip" on a .tar.gz file.
Why not? From RFC2616 compliant servers I'd expect a .tar.gz file to have the Content-* headers provide meta-information about the content[1], e.g. Content-Type: application/x-tar Content-Encoding: gzip Transfer-Encoding: chunked If I want to detach the gzip encoding from the "content" (or "entity"), I'd move it to the Transfer-Encoding header[2], e.g.: Content-Type: application/x-tar Transfer-Encoding: gzip, chunked [1]: See RFC2616 sec7.2.1: "Content-Type specifies the media type of the underlying data. Content-Encoding may be used to indicate any additional content codings applied to the data, usually for the purpose of data compression, that are a property of the requested resource." [2]: See RFC2616 sec4.3: "Transfer-Encoding is a property of the message, not of the entity, and thus MAY be added or removed by any application along the request/response chain."

On Mon, Aug 29, 2011 at 5:28 PM, Herbert Valerio Riedel
On Mon, 2011-08-29 at 13:21 +0200, Gregory Collins wrote:
A web server should not be setting "Content-encoding: gzip" on a .tar.gz file.
Why not? From RFC2616 compliant servers I'd expect a .tar.gz file to have the Content-* headers provide meta-information about the content[1], e.g.
Content-Type: application/x-tar Content-Encoding: gzip Transfer-Encoding: chunked
If I want to detach the gzip encoding from the "content" (or "entity"), I'd move it to the Transfer-Encoding header[2], e.g.:
Content-Type: application/x-tar Transfer-Encoding: gzip, chunked
[1]: See RFC2616 sec7.2.1: "Content-Type specifies the media type of the underlying data. Content-Encoding may be used to indicate any additional content codings applied to the data, usually for the purpose of data compression, that are a property of the requested resource."
[2]: See RFC2616 sec4.3: "Transfer-Encoding is a property of the message, not of the entity, and thus MAY be added or removed by any application along the request/response chain."
"chunked" is the only valid transfer-encoding[1], while gzip must be specified on the content-encoding header[2]. For a simple example of these two, look at the response headers from Haskellers[3] in something like Chrome developer tools. [1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6 [2] http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5 [3] http://www.haskellers.com/

On Mon, Aug 29, 2011 at 4:28 PM, Herbert Valerio Riedel
On Mon, 2011-08-29 at 13:21 +0200, Gregory Collins wrote:
A web server should not be setting "Content-encoding: gzip" on a .tar.gz file.
Why not? From RFC2616 compliant servers I'd expect a .tar.gz file to have the Content-* headers provide meta-information about the content[1], e.g.
Content-Type: application/x-tar Content-Encoding: gzip Transfer-Encoding: chunked
The way I would interpret this is: this MIME body is a TAR file which has been gzip-encoded for the purpose of efficiency in transmission. When I ask the library for the body contents as an octet stream, I would expect to get the TAR file contents, uncompressed. This is how it works when you send text/html with "Content-Encoding: gzip", I don't understand why it should be different with a .tar.gz file. If you wanted the MIME body to be passed through unmolested (i.e. you expect the octet stream to actually be in gzip-compressed TAR format), I would expect that you set "Content-Type: application/x-tgz" without a Content-Encoding. But that's just my interpretation both of standard practice and of the spec.
If I want to detach the gzip encoding from the "content" (or "entity"), I'd move it to the Transfer-Encoding header[2], e.g.:
Content-Type: application/x-tar Transfer-Encoding: gzip, chunked
As Michael mentioned, that isn't how those headers are interpreted.
G
--
Gregory Collins
participants (6)
-
Brandon Allbery
-
Conrad Parker
-
Erik de Castro Lopo
-
Gregory Collins
-
Herbert Valerio Riedel
-
Michael Snoyman