Re: behaviour change in getDirectoryContents in GHC 7.2?

2 Nov 2011


      On Wed, Nov 02, 2011 at 07:59:21PM +0000, Max Bolingbroke wrote:
...
On 2 November 2011 19:13, Ian Lynagh  wrote:
...
They are allowed to occur in Linux/ext2 filenames, anyway, and I think
we ought to be able to handle them correctly if they do.
In Python, if a filename is decoded using UTF8 and the "surrogate
escape" error handler, occurrences of lone surrogates are a decoding
error because they are not allowed to occur in UTF-8 text. As a result
the lone surrogate is put into the string escaped so it can be
roundtripped back to a lone surrogate on output. So Python works OK.
In GHC >= 7.2, if a filename is decoded using UTF8 and the "Roundtrip"
error handler, occurrences of 0xEFNN are not a decoding error because
they are perfectly fine Unicode codepoints. As a result they get put
into the string unescaped, and so when we try to roundtrip the string
we get the byte 0xNN in the output rather than the UTF-8 encoding of
0xEFNN. So GHC does not work OK in this situation :-(
Are you saying there's a bug that should be fixed?


Thanks
Ian