
On 2 November 2011 19:13, Ian Lynagh
[snip some stuff I didn't understand. I think I made the mistake of entering a Unicode discussion]
Sorry, perhaps that was too opaque! The problem is that if we commit to support occurrences of the private-use codepoint 0xEF80 then what happens if we: 1. Decode the UTF-32le data [0x80, 0xEF, 0x00, 0x00] to a string "\xEF80" 2. Pass the string "\xEF80" to a function that encodes it using an encoding which knows about the escaping mechanism. 3. Consequently encode "\xEF80" as [0x80] This seems a bit sad.
They are allowed to occur in Linux/ext2 filenames, anyway, and I think we ought to be able to handle them correctly if they do.
In Python, if a filename is decoded using UTF8 and the "surrogate escape" error handler, occurrences of lone surrogates are a decoding error because they are not allowed to occur in UTF-8 text. As a result the lone surrogate is put into the string escaped so it can be roundtripped back to a lone surrogate on output. So Python works OK. In GHC >= 7.2, if a filename is decoded using UTF8 and the "Roundtrip" error handler, occurrences of 0xEFNN are not a decoding error because they are perfectly fine Unicode codepoints. As a result they get put into the string unescaped, and so when we try to roundtrip the string we get the byte 0xNN in the output rather than the UTF-8 encoding of 0xEFNN. So GHC does not work OK in this situation :-( (The problem I outlined at the start of this email doesn't arise with the lone surrogate mechanism because surrogates aren't allowed in UTF-32 text either. So step 1 in the process would have failed with a decoding error.) Hope that helps, Max