
Marcin 'Qrczak' Kowalczyk wrote:
Encouraged by Mono, for my language Kogut I adopted a hack that Unicode people hate: the possibility to use a modified UTF-8 variant where byte sequences which are illegal in UTF-8 are decoded into U+0000 followed by another character.
I don't like the idea of using U+0000, because it looks like an ASCII control character, and in any case has a long tradition of being used for something else. Why not use a code point that can't result from decoding a valid UTF-8 string? U+FFFF (EF BF BF) is illegal in UTF-8, for example, and I don't think it's legal UTF-16 either. This would give you round-tripping for all legal UTF-8 and UTF-16 strings. Or you could use values from U+DC00 to U+DFFF, which definitely aren't legal UTF-8 or UTF-16. There's plenty of room there to encode each invalid UTF-8 byte in a single word, instead of a sequence of two words. A much cleaner solution would be to reserve part of the private use area, say U+109780 through U+1097FF (DBE5 DF80 through DBE5 DFFF). There's a pretty good chance you won't collide with anyone. It's too bad Unicode hasn't set aside 128 code points for this purpose. Maybe we should grab some unassigned code points, document them, and hope it catches on. There's a lot to be said for any encoding, however nasty, that at least takes ASCII to ASCII. Often people just want to inspect the ASCII portions of a string while leaving the rest untouched (e.g. when parsing "--output-file=¡£ª±ïñ¹!.txt"), and any encoding that permits this is good enough.
Alternatives were:
* Use byte strings and character strings in different places, sometimes using a different type depending on the OS (Windows filenames would be character strings).
* Fail when encountering byte strings which can't be decoded.
Another alternative is to simulate the existence of a UTF-8 locale on Win32. Represent filenames as byte strings on both platforms; on NT convert between UTF-8 and UTF-16 when interfacing with the outside; on 9x either use the ANSI/OEM encoding internally or convert between UTF-8 and the ANSI/OEM encoding. I suppose NT probably doesn't check that the filenames you pass to the kernel are valid UTF-16, so there's some possibility that files with illegal names might be accessible to other applications but not to Haskell applications. But I imagine such files are much rarer than Unix filenames that aren't legal in the current locale. And you could still use the private-encoding trick if not. -- Ben