patch applied (cabal): Make UTF-8 decoding errors in .cabal files non-fatal

Wed Mar 26 20:17:40 PDT 2008 Duncan Coutts

On Thu, Mar 27, 2008 at 08:45:29AM -0700, Duncan Coutts wrote:
Wed Mar 26 20:17:40 PDT 2008 Duncan Coutts
* Make UTF-8 decoding errors in .cabal files non-fatal Previously we checked for invalid UTF-8 in the first phase of the parser, which splitting the file up into nested sections and fields. This meant the whole parser falls over if there is invalid UTF-8 anywhere in the file. Sadly there are already packages on hackage with invalid UTF-8 so we would fail when parsing the hackage index. The solution is to move the check into the parsing of the individual fields and making it a warning not an error. We most typically get invalid UTF-8 in free text fields like author name, copyright, description etc so this should work out ok usually. We now get pretty decent error messages, like: Warning: hsx.cabal:5: Invalid UTF-8 text in the 'author' field. The warning type is now structured so that hackage will be able to distinguish general non-fatal warnings from UTF-8 decoding problems which really should be fatal errors for package uploads.
These invalid UTF-8 strings are usually valid Latin-1 in people's names, which the web interface needs to show. So would it be possible give the warning, but either to treat bytes that comprise an encoding error as Latin-1 Chars, or to reparse a string (or file) with UTF errors as a Latin-1 string? In almost all cases, the problematic sequence is a single non-ASCII byte surrounded by ASCII bytes.

In message <20080327160333.GA10636@soi.city.ac.uk> cabal-devel@haskell.org writes:
On Thu, Mar 27, 2008 at 08:45:29AM -0700, Duncan Coutts wrote:
Wed Mar 26 20:17:40 PDT 2008 Duncan Coutts
* Make UTF-8 decoding errors in .cabal files non-fatal Previously we checked for invalid UTF-8 in the first phase of the parser, which splitting the file up into nested sections and fields. This meant the whole parser falls over if there is invalid UTF-8 anywhere in the file. Sadly there are already packages on hackage with invalid UTF-8 so we would fail when parsing the hackage index. The solution is to move the check into the parsing of the individual fields and making it a warning not an error. We most typically get invalid UTF-8 in free text fields like author name, copyright, description etc so this should work out ok usually. We now get pretty decent error messages, like: Warning: hsx.cabal:5: Invalid UTF-8 text in the 'author' field. The warning type is now structured so that hackage will be able to distinguish general non-fatal warnings from UTF-8 decoding problems which really should be fatal errors for package uploads. These invalid UTF-8 strings are usually valid Latin-1 in people's names, which the web interface needs to show.
Can't we just reject them with the error message and ask people to fix the latin-1 sequences and re-upload using proper UTF-8? Is the web interface sending UTF-8 now? I don't know if we've done an end-to-end test yet. If we have then we should close ticket #145: http://hackage.haskell.org/trac/hackage/ticket/145
So would it be possible give the warning, but either to treat bytes that comprise an encoding error as Latin-1 Chars, or to reparse a string (or file) with UTF errors as a Latin-1 string? In almost all cases, the problematic sequence is a single non-ASCII byte surrounded by ASCII bytes.
I really don't think we should continue to allow mixed/undefined encodings. We should be strict about enforcing UTF-8, but of course we should provide helpful error messages to make it easy for people to make the corrections. So I think hackage should reject them with a suitable error message. I can send a patch. On that topic in fact, I think all parser warnings should be fatal errors as far as hackage is concerned. I'll send a separate patch for that. The more permissive hackage is, the more legacy of weird corner cases we accumulate. You suggested previously that we should add a warning for the cases where an isolated latin-1 char in someone's name turns out to be valid UTF-8 (but encoding for an unexpected char). I think that's a good idea. Obviously that'd want to be a non-fatal warning. Hmm, I now can't find the note where you made that suggestion. Can you give more details on how that check would work exactly? Duncan

On Thu, Mar 27, 2008 at 04:39:17PM +0000, Duncan Coutts wrote:
Can't we just reject them with the error message and ask people to fix the latin-1 sequences and re-upload using proper UTF-8?
The problem is that there are packages there now with .cabal files assuming Latin-1. Stopping more of them from getting in is fine, but we need to display the ones that are there correctly. Hmm, after considering a few schemes it's probably simplest to introduce strict enforcement on upload and retroactively patch the existing Latin-1 packages to UTF. Naughty, but a one-off.
You suggested previously that we should add a warning for the cases where an isolated latin-1 char in someone's name turns out to be valid UTF-8 (but encoding for an unexpected char). I think that's a good idea. Obviously that'd want to be a non-fatal warning. Hmm, I now can't find the note where you made that suggestion. Can you give more details on how that check would work exactly?
The common case is ASCII char, non-ASCII char, ASCII char. That's not a valid UTF-8 sequence, but fromUTF is erroneously accepting it. It needs to tighten up to keep these errors out. Incidentally, a UTF decoder is also supposed to reject non-minimal encodings, e.g. a 3-byte encoding for a Char that can be encoded in 2 bytes. That's to force canonical encodings for security.

In message <20080327172026.GB10636@soi.city.ac.uk> cabal-devel@haskell.org writes:
On Thu, Mar 27, 2008 at 04:39:17PM +0000, Duncan Coutts wrote:
Can't we just reject them with the error message and ask people to fix the latin-1 sequences and re-upload using proper UTF-8?
The problem is that there are packages there now with .cabal files assuming Latin-1. Stopping more of them from getting in is fine, but we need to display the ones that are there correctly.
Parsing them is essential, displaying them correctly is a bonus.
Hmm, after considering a few schemes it's probably simplest to introduce strict enforcement on upload and retroactively patch the existing Latin-1 packages to UTF. Naughty, but a one-off.
I'm quite happy for those to be fixed. The main point is that parsing the files does not fail, though the content for those fields would (or at least should) contain a Unicode replacement char.
You suggested previously that we should add a warning for the cases where an isolated latin-1 char in someone's name turns out to be valid UTF-8 (but encoding for an unexpected char). I think that's a good idea. Obviously that'd want to be a non-fatal warning. Hmm, I now can't find the note where you made that suggestion. Can you give more details on how that check would work exactly?
The common case is ASCII char, non-ASCII char, ASCII char. That's not a valid UTF-8 sequence, but fromUTF is erroneously accepting it. It needs to tighten up to keep these errors out.
Hmm. I'll replace the UTF decoder with the one from the utf8-string package (which is also BSD licensed).
Incidentally, a UTF decoder is also supposed to reject non-minimal encodings, e.g. a 3-byte encoding for a Char that can be encoded in 2 bytes. That's to force canonical encodings for security.
I believe the utf8-string version does that correctly. It detects over-long encodings specifically and makes them an invalid char. As I understand it that's so that it generates a single replacement char for non-minimal encodings rather than several. Duncan

On Thu, Mar 27, 2008 at 04:39:17PM +0000, Duncan Coutts wrote:
So I think hackage should reject them with a suitable error message. I can send a patch. On that topic in fact, I think all parser warnings should be fatal errors as far as hackage is concerned.
OK, parser warnings (including UTF decode errors) are now fatal. (When the decoder detects all the illegals, they'll be kept out.) Existing Latin-1 .cabal files have been transliterated to UTF-8.

In message <20080328122052.GA21271@soi.city.ac.uk> cabal-devel@haskell.org writes:
On Thu, Mar 27, 2008 at 04:39:17PM +0000, Duncan Coutts wrote:
So I think hackage should reject them with a suitable error message. I can send a patch. On that topic in fact, I think all parser warnings should be fatal errors as far as hackage is concerned.
OK, parser warnings (including UTF decode errors) are now fatal.
Great. Is that patch pushed to the repo? I don't see it yet. I've got a patch to update generally to the latest Cabal api, though if you've got the utf warnings form the latest cabal then have you had to make all those changes already?
(When the decoder detects all the illegals, they'll be kept out.)
Yep, working on that.
Existing Latin-1 .cabal files have been transliterated to UTF-8.
Fantastic, thanks. Duncan

On Fri, Mar 28, 2008 at 01:53:04PM +0000, Duncan Coutts wrote:
Is that patch pushed to the repo? I don't see it yet. I've got a patch to update generally to the latest Cabal api, though if you've got the utf warnings form the latest cabal then have you had to make all those changes already?
Sorry, I sometimes forget the two-stage process with darcs. Yes, I think it's all done.

In message <20080328140205.GA21593@soi.city.ac.uk> cabal-devel@haskell.org writes:
On Fri, Mar 28, 2008 at 01:53:04PM +0000, Duncan Coutts wrote:
Is that patch pushed to the repo? I don't see it yet. I've got a patch to update generally to the latest Cabal api, though if you've got the utf warnings form the latest cabal then have you had to make all those changes already?
Sorry, I sometimes forget the two-stage process with darcs. Yes, I think it's all done.
Got it. Thanks. :-) Duncan
participants (2)
-
Duncan Coutts
-
Ross Paterson