Re: Proposal #3455: Add a setting to change how Unicode encoding errors are handled

31 Aug 2009

      On Mon, Aug 31, 2009 at 7:29 PM, Judah Jacobson wrote:
...
On Tue, Aug 25, 2009 at 5:10 AM, Simon Marlow wrote:
...
On 23/08/2009 17:22, Judah Jacobson wrote:
...
I proposal that we augment ghc-6.12.1's support for Unicode Handles
by adding the following functions to System.IO:
hSetOnEncodingError :: Handle ->  OnEncodingError ->  IO ()
hGetOnEncodingError :: Handle ->  IO OnEncodingError
as well as the enumeration `OnEncodingError` with three constructors:
- `ThrowEncodingError`: Throw an exception at the first encoding or
 decoding
   error.
 - `SkipEncodingError`: Skip all invalid bytes or characters.
 - `TranslitEncodingError`: Replace undecodable bytes with u+FFFD, and
 unencodable characters with '?'.
As a brief, possibly irrelevant aside:

There is one other option for how to handle Unicode en/decoding errors that
I've used and seen used.
It is the basis of Markus Kuhn's "UTF-8B" encoding whereby parse errors are
read as 0xdc00 + the raw byte, which when you go to emit them, you can emit
them directly into the stream as raw bytes. This permits a perfect round
trip from UTF-8 to String to UTF-8, regardless of encoding errors. The
codepoints from 0xdc80-0xdcff don't conflict with UTF-16, because they are
in the unmapped d800-dfff range and in ISO 10646-1 section R.4 it notes that
the mapping of those code positions in UTF8 are undefined, so an
implementation is free to do with them as it pleases. The main good thing
that comes with this representation is that no information is discarded. It
doesn't hurt that this also sidesteps the other uses of the d800-dfff range
like the illegal Oracle-style "CESU-8" encoding of surrogate pairs, etc.

http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html
http://bsittler.livejournal.com/10381.html

-Edward Kmett