
On 8 November 2011 11:43, Simon Marlow
Don't you mean 1 is what we have?
Yes, sorry!
Failing to roundtrip in some cases, and doing so silently, seems highly suboptimal to me. I'm sorry I didn't pick up on this at the time (Unicode is a swamp :).
I *can* change the implementation back to using lone surrogates. This gives us guaranteed roundtripping but it means that the user might see lone-surrogate Char values in Strings from the filesystem/command line. IIRC this does break some software -- e.g. Brian's "text" library explicitly checks for such characters and fails if it detects them. So whatever happens we are going to end up making some group of users unhappy! * No PEP383: Haskellers using non-ASCII get upset when their command line argument [String]s aren't in fact sequences of characters, but sequences of bytes in some arbitrary encoding * PEP383(surrogates): Unicoders get upset by lone surrogates (which can actually occur at the moment, independent of PEP383 -- e.g. as character literals or from FFI) * PEP383(private chars): Unixers get upset that we can't roundtrip byte sequences that look like the codepoint 0xEFXX encoded in the current locale. In practice, 0xEFXX is only decodable from a UTF encoding, so we fail to roundtrip byte sequences like the one Ian posted. I'm happy to implement any behaviour, I would just like to know that whatever it is is accepted as the correct tradeoff :-) RE exposing a ByteString based interface to the IO library from base/unix/whatever: AFAIK Python doesn't do this, and just tells people to use the (x.encode(sys.getfilesystemencoding(), "surrogateescape")) escape hatch, which is what I've been recommending. I think this would be more satisfying to John if it were actually guaranteed to work on arbitrary byte sequences, not just *highly likely* to work :-) Max