
On Mon, Aug 31, 2009 at 7:29 PM, Judah Jacobson
On Tue, Aug 25, 2009 at 5:10 AM, Simon Marlow
wrote: On 23/08/2009 17:22, Judah Jacobson wrote:
I proposal that we augment ghc-6.12.1's support for Unicode Handles by adding the following functions to System.IO:
hSetOnEncodingError :: Handle -> OnEncodingError -> IO () hGetOnEncodingError :: Handle -> IO OnEncodingError
as well as the enumeration `OnEncodingError` with three constructors:
- `ThrowEncodingError`: Throw an exception at the first encoding or decoding error. - `SkipEncodingError`: Skip all invalid bytes or characters. - `TranslitEncodingError`: Replace undecodable bytes with u+FFFD, and unencodable characters with '?'.
As a brief, possibly irrelevant aside: There is one other option for how to handle Unicode en/decoding errors that I've used and seen used. It is the basis of Markus Kuhn's "UTF-8B" encoding whereby parse errors are read as 0xdc00 + the raw byte, which when you go to emit them, you can emit them directly into the stream as raw bytes. This permits a perfect round trip from UTF-8 to String to UTF-8, regardless of encoding errors. The codepoints from 0xdc80-0xdcff don't conflict with UTF-16, because they are in the unmapped d800-dfff range and in ISO 10646-1 section R.4 it notes that the mapping of those code positions in UTF8 are undefined, so an implementation is free to do with them as it pleases. The main good thing that comes with this representation is that no information is discarded. It doesn't hurt that this also sidesteps the other uses of the d800-dfff range like the illegal Oracle-style "CESU-8" encoding of surrogate pairs, etc. http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html http://bsittler.livejournal.com/10381.html -Edward Kmett