
On 9 November 2011 13:11, Ian Lynagh
If we aren't going to guarantee that the encoded string is unicode, then is there any benefit to encoding it in the first place?
(I think you mean decoded here - my understanding is that decode :: ByteString -> String, encode :: String -> ByteString)
Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc?
(Max gave some reasons earlier in this thread, but I'd need examples of what goes wrong to understand them).
We can do this but it doesn't solve all problems. Here are two such problems: PROBLEM 1 (bleeding from non-escaping to escaping TextEncodings) === So let's say we are reading a filename from stdin. Currently stdin uses the utf8 TextEncoding -- this TextEncoding knows nothing about private-char roundtripping, and will throw an exception when decoding bad bytes or encoding our private chars. Now the user types a UTF-8 U+EF80 character - i.e. we get the bytes 0xEE 0xBC 0x80 on stdin. The utf8 TextEncoding naively decodes this byte sequence to the character sequence U+EF80. We have lost at this point: if the user supplies the resulting String to a function that encodes the String with the fileSystemEncoding, the String will be encoded into the byte sequence 0x80. This is probably not what we want to happen! It means that a program like this: """ main = do fp <- getLine readFile fp >>= putStrLn """ Will fail ("file not found: \x80") when given the name of an (existant) file 0xEE 0xBC 0x80. PROBLEM 2 (bleeding between two different escaping TextEncodings) === So let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes through the fileSystemEncoding. In your scheme the resulting Char sequence is U+EFEE U+EFBC U+EF80. What happens when we that *encode* that Char sequence using a UTF-16 TextEncoding (that knows about the 0xEFxx escape mechanism)? The resulting byte sequence is 0xEE 0xBC 0x80, NOT the UTF-16 encoded version of U+EF00! This is certainly contrary to what the user would expect. PROBLEM 3 (bleeding from escaping to non-escaping TextEncodings) === Just as above, let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes through the fileSystemEncoding. In your scheme the resulting Char sequence is U+EFEE U+EFBC U+EF80. If you try to write this String to stdout (which uses the UTF-8 encoding that knows nothing about 0xEFxx escapes) you just get an exception, NOT the UTF-8 encoded version of U+EF00. Game over man, game over! CONCLUSION === As far as I can see, the proposed escaping scheme recovers the roundtrip property but fails to regain a lot of other reasonable-looking behaviours. (Note that the above outlined problems are problems in the current implementation too -- but the current implementation doesn't even pretend to support U+EFxx characters. Its correctness is entirely dependent on them never showing up, which is why we chose a part of the private codepoint region that is reserved specifically for the purpose of encoding hacks). Max