
On Thu, Mar 27, 2008 at 04:39:17PM +0000, Duncan Coutts wrote:
Can't we just reject them with the error message and ask people to fix the latin-1 sequences and re-upload using proper UTF-8?
The problem is that there are packages there now with .cabal files assuming Latin-1. Stopping more of them from getting in is fine, but we need to display the ones that are there correctly. Hmm, after considering a few schemes it's probably simplest to introduce strict enforcement on upload and retroactively patch the existing Latin-1 packages to UTF. Naughty, but a one-off.
You suggested previously that we should add a warning for the cases where an isolated latin-1 char in someone's name turns out to be valid UTF-8 (but encoding for an unexpected char). I think that's a good idea. Obviously that'd want to be a non-fatal warning. Hmm, I now can't find the note where you made that suggestion. Can you give more details on how that check would work exactly?
The common case is ASCII char, non-ASCII char, ASCII char. That's not a valid UTF-8 sequence, but fromUTF is erroneously accepting it. It needs to tighten up to keep these errors out. Incidentally, a UTF decoder is also supposed to reject non-minimal encodings, e.g. a 3-byte encoding for a Char that can be encoded in 2 bytes. That's to force canonical encodings for security.