Re: FilePath as ADT

6 Feb 2006

      Ben Rudiak-Gould  writes:
...
I don't like the idea of using U+0000, because it looks like an ASCII
control character, and in any case has a long tradition of being used
for something else. Why not use a code point that can't result from
decoding a valid UTF-8 string? U+FFFF (EF BF BF) is illegal in UTF-8,
It is legal. It's meaningless for data exchange, but OSes don't
prevent creating a file with UTF-8-encoded U+FFFF in its name,
and a true UTF-8 decoder interprets that byte sequence as U+FFFF.

U+0000 and surrogates are the only code points which can't appear
in true UTF-8-encoded filenames, and thus using them is necessary
to be fully compatible with true UTF-8.
...
Or you could use values from U+DC00 to U+DFFF,
Right, but somehow I like U+0000 more.
...
A much cleaner solution would be to reserve part of the private use
area, say U+109780 through U+1097FF (DBE5 DF80 through DBE5 DFFF).
This would not be fully compatible with true UTF-8, because these
characters already have a representation in UTF-8.
...
There's a lot to be said for any encoding, however nasty, that at
least takes ASCII to ASCII.
Right, but '\0' can't appear in filenames.

My conversion routines for strings exchanged with C assume that
the default encoding leaves ASCII except NUL unchanged. NUL has
to be special-cased anyway because in most cases it's disallowed
in a C string. So the fast path checks whether all characters
are U+0001..U+007F, and if so, the string is used directly by C
(my representation of strings uses one byte per character with '\0'
at the end if the string has no characters above U+00FF). Otherwise
it's encoded using the dynamically specified default encoding,
and there is an additional check whether the *resulting* string
contains no '\0', which is an error.

Conversion of file contents doesn't take shortcuts, doesn't assume
anything about ASCII compatibility. It always works on buffers
containing 4-byte characters.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: FilePath as ADT

Marcin 'Qrczak' Kowalczyk