
On 10 November 2011 00:17, Ian Lynagh
On Wed, Nov 09, 2011 at 03:58:47PM +0000, Max Bolingbroke wrote:
(Note that the above outlined problems are problems in the current implementation too
Then the proposal seems to me to be strictly better than the current system. Under both systems the wrong thing happen when U+EFxx is entered as unicode text, but the proposed system works for all filenames read from the filesystem.
Your proposal is not *strictly* better than what is implemented in at least the following ways: 1. With your proposal, if you read a filename containing U+EF80 into the variable "fp" and then expect the character U+EF80 to be in fp you will be surprised to only find its escaped form. In the current implementation you will in fact find U+EF80. 2. The performance of iconv-based decoders will suffer because we will need to do a post-pass in the TextEncoding to do this extra escaping for U+EFxx characters I'm really not keen about implementing a fix that addresses such a limited subset of the problems, anyway.
In the longer term, I think we need to fix the underlying problem that (for example) both getLine and getArgs produce a String from bytes, but do so in different ways. At some point we should change the type of getArgs and friends.
I'm not sure about this. hGetLine produces a String from bytes in a different way depending on the encoding set on the Handle, but we don't try to differentiate in the type system between Strings decoded using different TextEncodings. Why should getLine and getArgs be different? If you are really unhappy about getLine and getArgs having different behaviour in this sense, one option would be to change the default stdout/stdin TextEncoding to use the fileSystemEncoding that knows about escapes. (Note that this would mean that your Haskell program wouldn't immediately die if you were using the UTF8 locale and then tried to read some non-UTF8 input from stdin, which might or might not be a good thing, depending on the application.) Max