Proposal #3455: Add a setting to change how Unicode encoding errors are handled

I proposal that we augment ghc-6.12.1's support for Unicode Handles by adding the following functions to System.IO: hSetOnEncodingError :: Handle -> OnEncodingError -> IO () hGetOnEncodingError :: Handle -> IO OnEncodingError as well as the enumeration `OnEncodingError` with three constructors: - `ThrowEncodingError`: Throw an exception at the first encoding or decoding error. - `SkipEncodingError`: Skip all invalid bytes or characters. - `TranslitEncodingError`: Replace undecodable bytes with u+FFFD, and unencodable characters with '?'. I have implemented this functionality in a patch attached to the ticket. Haddock docs are here: http://code.haskell.org/~judah/new-io-docs/System-IO.html#23 The choice of error handler is orthogonal to the choice of encoder. Additionally, the same setting is used for both read and write modes. For portability, the handlers are written in pure Haskell rather than using GNU iconv's //TRANSLIT feature. Note that the text package, for example, provides more sophisticated error-handling options. However, I think the above choices are useful enough without making the API too complicated. Discussion deadline: September 9 Ticket: http://hackage.haskell.org/trac/ghc/ticket/3455 Best, -Judah

On Sun, 2009-08-23 at 09:22 -0700, Judah Jacobson wrote:
I proposal that we augment ghc-6.12.1's support for Unicode Handles by adding the following functions to System.IO:
hSetOnEncodingError :: Handle -> OnEncodingError -> IO () hGetOnEncodingError :: Handle -> IO OnEncodingError
I agree that it is important.
Note that the text package, for example, provides more sophisticated error-handling options. However, I think the above choices are useful enough without making the API too complicated.
Personally I would prefer we postpone this decision with the aim that we unify the Text encoding / decoding between the IO system and text package. Specifically I suggest we put this decision off until after ICFP where we hope that the authors of ghc's new text IO system and the authors of the text package can get together with other interested individuals to discuss some more unified system. It should be possible to make an encoder abstraction that can be used purely in the text package for conversion between ByteString <-> Text, and also used in the IO system for text mode Handles. How to handle encoding and translation errors would have to be part of the design of that encoder abstraction. My impression is that an ST version of the current text encoder abstraction in the GHC text IO system would also be usable for pure conversions in the text package. Duncan

On 23/08/2009 17:22, Judah Jacobson wrote:
I proposal that we augment ghc-6.12.1's support for Unicode Handles by adding the following functions to System.IO:
hSetOnEncodingError :: Handle -> OnEncodingError -> IO () hGetOnEncodingError :: Handle -> IO OnEncodingError
as well as the enumeration `OnEncodingError` with three constructors:
- `ThrowEncodingError`: Throw an exception at the first encoding or decoding error. - `SkipEncodingError`: Skip all invalid bytes or characters. - `TranslitEncodingError`: Replace undecodable bytes with u+FFFD, and unencodable characters with '?'.
I have implemented this functionality in a patch attached to the ticket. Haddock docs are here: http://code.haskell.org/~judah/new-io-docs/System-IO.html#23
The choice of error handler is orthogonal to the choice of encoder. Additionally, the same setting is used for both read and write modes. For portability, the handlers are written in pure Haskell rather than using GNU iconv's //TRANSLIT feature.
Note that the text package, for example, provides more sophisticated error-handling options. However, I think the above choices are useful enough without making the API too complicated.
I replied on the ticket, reproduced here for readers of libraries@: It looks like the main question here is whether the IOError should be returned explicitly (as in your patch), or whether we should just catch the exception. All things being equal, catching the exception would be simpler, as it wouldn't require any changes in the codecs. Is there a reason why you didn't do it that way? Perhaps because you want to be sure that the exception is really an encoding error, and not some other kind of exception? If that's the case, then we should introduce a new exception for encoding errors (that's probably a good idea anyway). Cheers, Simon

On Tue, Aug 25, 2009 at 5:10 AM, Simon Marlow
On 23/08/2009 17:22, Judah Jacobson wrote:
I proposal that we augment ghc-6.12.1's support for Unicode Handles by adding the following functions to System.IO:
hSetOnEncodingError :: Handle -> OnEncodingError -> IO () hGetOnEncodingError :: Handle -> IO OnEncodingError
as well as the enumeration `OnEncodingError` with three constructors:
- `ThrowEncodingError`: Throw an exception at the first encoding or decoding error. - `SkipEncodingError`: Skip all invalid bytes or characters. - `TranslitEncodingError`: Replace undecodable bytes with u+FFFD, and unencodable characters with '?'.
I have implemented this functionality in a patch attached to the ticket. Haddock docs are here: http://code.haskell.org/~judah/new-io-docs/System-IO.html#23
The choice of error handler is orthogonal to the choice of encoder. Additionally, the same setting is used for both read and write modes. For portability, the handlers are written in pure Haskell rather than using GNU iconv's //TRANSLIT feature.
Note that the text package, for example, provides more sophisticated error-handling options. However, I think the above choices are useful enough without making the API too complicated.
I replied on the ticket, reproduced here for readers of libraries@:
It looks like the main question here is whether the IOError should be returned explicitly (as in your patch), or whether we should just catch the exception. All things being equal, catching the exception would be simpler, as it wouldn't require any changes in the codecs. Is there a reason why you didn't do it that way? Perhaps because you want to be sure that the exception is really an encoding error, and not some other kind of exception? If that's the case, then we should introduce a new exception for encoding errors (that's probably a good idea anyway).
I agree that we should create a new exception type. Given the errors currently thrown by the library, I assume that it doesn't need to be anything more than a newtype wrapping a String message. If the text package and ghc's IO library are merged into a new system, then it would probably be better to explicitly return the error -- that way we can have pure ByteString <-> Text conversion functions. But for the current state of the library (where the encoding type is only exposed under GHC.* and makes few stability promises) it probably doesn't make a big difference. -Judah

On Mon, Aug 31, 2009 at 7:29 PM, Judah Jacobson
On Tue, Aug 25, 2009 at 5:10 AM, Simon Marlow
wrote: On 23/08/2009 17:22, Judah Jacobson wrote:
I proposal that we augment ghc-6.12.1's support for Unicode Handles by adding the following functions to System.IO:
hSetOnEncodingError :: Handle -> OnEncodingError -> IO () hGetOnEncodingError :: Handle -> IO OnEncodingError
as well as the enumeration `OnEncodingError` with three constructors:
- `ThrowEncodingError`: Throw an exception at the first encoding or decoding error. - `SkipEncodingError`: Skip all invalid bytes or characters. - `TranslitEncodingError`: Replace undecodable bytes with u+FFFD, and unencodable characters with '?'.
As a brief, possibly irrelevant aside: There is one other option for how to handle Unicode en/decoding errors that I've used and seen used. It is the basis of Markus Kuhn's "UTF-8B" encoding whereby parse errors are read as 0xdc00 + the raw byte, which when you go to emit them, you can emit them directly into the stream as raw bytes. This permits a perfect round trip from UTF-8 to String to UTF-8, regardless of encoding errors. The codepoints from 0xdc80-0xdcff don't conflict with UTF-16, because they are in the unmapped d800-dfff range and in ISO 10646-1 section R.4 it notes that the mapping of those code positions in UTF8 are undefined, so an implementation is free to do with them as it pleases. The main good thing that comes with this representation is that no information is discarded. It doesn't hurt that this also sidesteps the other uses of the d800-dfff range like the illegal Oracle-style "CESU-8" encoding of surrogate pairs, etc. http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html http://bsittler.livejournal.com/10381.html -Edward Kmett
participants (4)
-
Duncan Coutts
-
Edward Kmett
-
Judah Jacobson
-
Simon Marlow