
On 2 November 2011 16:29, Ian Lynagh
If I understand correctly, you use U+EF00-U+EFFF to encode the characters 0-255 when they are not a valid part of the UTF8 stream.
Yes.
So why not encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, and so on? Doesn't it then become completely reversible?
This was also suggested by Mark Lentczner at the time I wrote the patch, but I raised a few objections (reproduced below): """ This would require us to: 1. Unconditionally decode these bytes sequences using the escape mechanism, even if using a non-roundtripping encoding. This is because the chars that result might be fed back into a roundtripping encoding, where they would otherwise get confused with escapes representing some other bytes. 2. Unconditonally decode these particular characters from escapes, even if using a non-roundtripping decoding -- necessary because of 1. Which are both a little annoying. Perhaps more seriously, it would play badly with e.g. reading in UTF-8 and writing out UTF-16, because your UTF-16 would have bits of UTF-8 representing these private-use chars embedded within it.. """ So although this is approach is somewhat attractive, I'm not sure the benefits of complete roundtripping outweigh the costs. This is why the unmodified PEP383 approach is kind of nice - it uses lone surrogate (rather than private use) codepoints to do the escaping, and these codepoints are simply not allowed to occur in valid UTF-encoded text. Max