
On Wed, Nov 02, 2011 at 07:59:21PM +0000, Max Bolingbroke wrote:
On 2 November 2011 19:13, Ian Lynagh
wrote: They are allowed to occur in Linux/ext2 filenames, anyway, and I think we ought to be able to handle them correctly if they do.
In Python, if a filename is decoded using UTF8 and the "surrogate escape" error handler, occurrences of lone surrogates are a decoding error because they are not allowed to occur in UTF-8 text. As a result the lone surrogate is put into the string escaped so it can be roundtripped back to a lone surrogate on output. So Python works OK.
In GHC >= 7.2, if a filename is decoded using UTF8 and the "Roundtrip" error handler, occurrences of 0xEFNN are not a decoding error because they are perfectly fine Unicode codepoints. As a result they get put into the string unescaped, and so when we try to roundtrip the string we get the byte 0xNN in the output rather than the UTF-8 encoding of 0xEFNN. So GHC does not work OK in this situation :-(
Are you saying there's a bug that should be fixed? Thanks Ian