On Wed, Apr 2, 2014 at 9:08 AM, Kazu Yamamoto <kazu@iij.ad.jp> wrote:
Hi Michael,

Thank you for your reply.

> I suppose theoretically you could be talking about a situation where Mighty
> is hosting a CGI application that receives user data and produces a static
> HTML file as a result.

Yes. Also I'm thinking about Yesod.


Yesod has more of a focus on dynamic content, and in those cases, we *do* already set charset=utf8[1]. Where this would affect Yesod is in yesod-static, in which case the same logic I've applied to Mighty would apply: users should not be able to affect the content of static files under normal circumstances, so the security concern is pretty remote.

[1] https://github.com/yesodweb/yesod/blob/master/yesod-core/Yesod/Core/Content.hs#L161
 
> But it
> could be worked around by the CGI application using <meta charset=...>
> instead.

Yes. Is this rarely used in Yesod?


Yes. Dynamic responses don't normally go via static file serving at all. In WAI terms, we always end up with a ResponseBuilder, not a ResponseFile, for dynamic content.
 
> So that comes to the question: is it safe for Mighty, mime-types, etc, to
> require that all HTML files are stored as UTF-8? I'd say, as long as
> there's a way for a user to override that if necessary, it sounds good to
> me. mime-types does provide such a capability, so I'd be in favor of
> tweaking its textual types to include explicit charset information.

Probably I was too sensitive. Based on your discussion, it is
safer/better for Mighty not to hard-code charset.


To be clear, besides the security concerns, there is *definitely* a usability advantage in specifying charsets explicitly, in that the browser doesn't need to use defaults or guessing[2]. This just comes down to a numbers game: is it more likely that a browser will mis-guess the character encoding of UTF8 data, or that someone running Mighty will provide non-UTF8 data?

One other point in the favor of specifying encoding type is that serving of a file will *reliably* fail. Without a charset, some browsers may guess the wrong character encoding while others won't, which makes it difficult to debug. If you *always* serve with charset=utf8 and that turns out to be wrong, you'll find out quickly and reliably.

[2] http://en.wikipedia.org/wiki/Charset_detection