
On Wed, Apr 2, 2014 at 9:08 AM, Kazu Yamamoto
Hi Michael,
Thank you for your reply.
I suppose theoretically you could be talking about a situation where Mighty is hosting a CGI application that receives user data and produces a static HTML file as a result.
Yes. Also I'm thinking about Yesod.
Yesod has more of a focus on dynamic content, and in those cases, we *do* already set charset=utf8[1]. Where this would affect Yesod is in yesod-static, in which case the same logic I've applied to Mighty would apply: users should not be able to affect the content of static files under normal circumstances, so the security concern is pretty remote. [1] https://github.com/yesodweb/yesod/blob/master/yesod-core/Yesod/Core/Content....
But it could be worked around by the CGI application using <meta charset=...> instead.
Yes. Is this rarely used in Yesod?
Yes. Dynamic responses don't normally go via static file serving at all. In WAI terms, we always end up with a ResponseBuilder, not a ResponseFile, for dynamic content.
So that comes to the question: is it safe for Mighty, mime-types, etc, to require that all HTML files are stored as UTF-8? I'd say, as long as there's a way for a user to override that if necessary, it sounds good to me. mime-types does provide such a capability, so I'd be in favor of tweaking its textual types to include explicit charset information.
Probably I was too sensitive. Based on your discussion, it is safer/better for Mighty not to hard-code charset.
To be clear, besides the security concerns, there is *definitely* a usability advantage in specifying charsets explicitly, in that the browser doesn't need to use defaults or guessing[2]. This just comes down to a numbers game: is it more likely that a browser will mis-guess the character encoding of UTF8 data, or that someone running Mighty will provide non-UTF8 data? One other point in the favor of specifying encoding type is that serving of a file will *reliably* fail. Without a charset, some browsers may guess the wrong character encoding while others won't, which makes it difficult to debug. If you *always* serve with charset=utf8 and that turns out to be wrong, you'll find out quickly and reliably. [2] http://en.wikipedia.org/wiki/Charset_detection