
On 09/11/2011 10:39, Max Bolingbroke wrote:
On 8 November 2011 11:43, Simon Marlow
wrote: Don't you mean 1 is what we have?
Yes, sorry!
Failing to roundtrip in some cases, and doing so silently, seems highly suboptimal to me. I'm sorry I didn't pick up on this at the time (Unicode is a swamp :).
I *can* change the implementation back to using lone surrogates. This gives us guaranteed roundtripping but it means that the user might see lone-surrogate Char values in Strings from the filesystem/command line. IIRC this does break some software -- e.g. Brian's "text" library explicitly checks for such characters and fails if it detects them.
So whatever happens we are going to end up making some group of users unhappy! * No PEP383: Haskellers using non-ASCII get upset when their command line argument [String]s aren't in fact sequences of characters, but sequences of bytes in some arbitrary encoding * PEP383(surrogates): Unicoders get upset by lone surrogates (which can actually occur at the moment, independent of PEP383 -- e.g. as character literals or from FFI) * PEP383(private chars): Unixers get upset that we can't roundtrip byte sequences that look like the codepoint 0xEFXX encoded in the current locale. In practice, 0xEFXX is only decodable from a UTF encoding, so we fail to roundtrip byte sequences like the one Ian posted.
I'm happy to implement any behaviour, I would just like to know that whatever it is is accepted as the correct tradeoff :-)
I would be happy with the surrogate approach I think. Arguable if you try to treat a string with lone surrogates as Unicode and it fails, then that is a feature: the original string wasn't Unicode. All you can do with an invalid Unicode string is use it as a FilePath again, and the right thing will happen. Alternatively if we stick with the private char approach, it should be possible to have an escaping scheme for 0xEFxx characters in the input that would enable us to roundtrip correctly. That is, escape 0xEFxx into a sequence 0xYYEF 0xYYxx for some suitable YY. But perhaps that would be too expensive - an extra translation pass over the buffer after iconv (well, we do this for newline translation, so maybe it's not too bad).
RE exposing a ByteString based interface to the IO library from base/unix/whatever: AFAIK Python doesn't do this, and just tells people to use the (x.encode(sys.getfilesystemencoding(), "surrogateescape")) escape hatch, which is what I've been recommending. I think this would be more satisfying to John if it were actually guaranteed to work on arbitrary byte sequences, not just *highly likely* to work :-)
The performance overhead of all this worries me. withCString has taken a huge performance hit, and I think there are people who wnat to know that there aren't several complex encoding/decoding passes between their Haskell code and the POSIX API. We ought to be able to program to POSIX directly, and the same goes for Win32. Cheers, Simon