Re: behaviour change in getDirectoryContents in GHC 7.2?

9 Nov 2011

      On 9 November 2011 13:11, Ian Lynagh  wrote:
...
If we aren't going to guarantee that the encoded string is unicode, then
is there any benefit to encoding it in the first place?
(I think you mean decoded here - my understanding is that decode ::
ByteString -> String, encode :: String -> ByteString)
...
Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is
0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc?
(Max gave some reasons earlier in this thread, but I'd need examples of
what goes wrong to understand them).
We can do this but it doesn't solve all problems. Here are two such problems:

PROBLEM 1 (bleeding from non-escaping to escaping TextEncodings)
===

So let's say we are reading a filename from stdin. Currently stdin
uses the utf8 TextEncoding -- this TextEncoding knows nothing about
private-char roundtripping, and will throw an exception when decoding
bad bytes or encoding our private chars.

Now the user types a UTF-8 U+EF80 character - i.e. we get the bytes
0xEE 0xBC 0x80 on stdin.

The utf8 TextEncoding naively decodes this byte sequence to the
character sequence U+EF80.

We have lost at this point: if the user supplies the resulting String
to a function that encodes the String with the fileSystemEncoding, the
String will be encoded into the byte sequence 0x80. This is probably
not what we want to happen! It means that a program like this:

"""
main = do
  fp <- getLine
  readFile fp >>= putStrLn
"""

Will fail ("file not found: \x80") when given the name of an
(existant) file 0xEE 0xBC 0x80.

PROBLEM 2 (bleeding between two different escaping TextEncodings)
===

So let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence
0xEE 0xBC 0x80) as a command line argument, so it goes through the
fileSystemEncoding. In your scheme the resulting Char sequence is
U+EFEE U+EFBC U+EF80.

What happens when we that *encode* that Char sequence using a UTF-16
TextEncoding (that knows about the 0xEFxx escape mechanism)? The
resulting byte sequence is 0xEE 0xBC 0x80, NOT the UTF-16 encoded
version of U+EF00! This is certainly contrary to what the user would
expect.

PROBLEM 3 (bleeding from escaping to non-escaping TextEncodings)
===

Just as above, let's say the user supplies the UTF-8 encoded U+EF00
(byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes
through the fileSystemEncoding. In your scheme the resulting Char
sequence is U+EFEE U+EFBC U+EF80.

If you try to write this String to stdout (which uses the UTF-8
encoding that knows nothing about 0xEFxx escapes) you just get an
exception, NOT the UTF-8 encoded version of U+EF00. Game over man,
game over!

CONCLUSION
===

As far as I can see, the proposed escaping scheme recovers the
roundtrip property but fails to regain a lot of other
reasonable-looking behaviours.

(Note that the above outlined problems are problems in the current
implementation too -- but the current implementation doesn't even
pretend to support U+EFxx characters. Its correctness is entirely
dependent on them never showing up, which is why we chose a part of
the private codepoint region that is reserved specifically for the
purpose of encoding hacks).

Max

Re: behaviour change in getDirectoryContents in GHC 7.2?

Max Bolingbroke