
Ben Rudiak-Gould
I don't like the idea of using U+0000, because it looks like an ASCII control character, and in any case has a long tradition of being used for something else. Why not use a code point that can't result from decoding a valid UTF-8 string? U+FFFF (EF BF BF) is illegal in UTF-8,
It is legal. It's meaningless for data exchange, but OSes don't prevent creating a file with UTF-8-encoded U+FFFF in its name, and a true UTF-8 decoder interprets that byte sequence as U+FFFF. U+0000 and surrogates are the only code points which can't appear in true UTF-8-encoded filenames, and thus using them is necessary to be fully compatible with true UTF-8.
Or you could use values from U+DC00 to U+DFFF,
Right, but somehow I like U+0000 more.
A much cleaner solution would be to reserve part of the private use area, say U+109780 through U+1097FF (DBE5 DF80 through DBE5 DFFF).
This would not be fully compatible with true UTF-8, because these characters already have a representation in UTF-8.
There's a lot to be said for any encoding, however nasty, that at least takes ASCII to ASCII.
Right, but '\0' can't appear in filenames. My conversion routines for strings exchanged with C assume that the default encoding leaves ASCII except NUL unchanged. NUL has to be special-cased anyway because in most cases it's disallowed in a C string. So the fast path checks whether all characters are U+0001..U+007F, and if so, the string is used directly by C (my representation of strings uses one byte per character with '\0' at the end if the string has no characters above U+00FF). Otherwise it's encoded using the dynamically specified default encoding, and there is an additional check whether the *resulting* string contains no '\0', which is an error. Conversion of file contents doesn't take shortcuts, doesn't assume anything about ASCII compatibility. It always works on buffers containing 4-byte characters. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/