
Axel Simon
The solution of representing a file name abstractly is also used by the Java libraries.
I think it is not. Besides using Java UTF-16 strings for filenames, there is the File class, but it also uses Java strings. The documentation of listFiles() says that each resulting File is made using the File(File, String) constructor. The GNU Java implementation uses a single Java string inside it. On Windows the OS uses UTF-16 strings natively rather than byte sequences. UTF-16 and Unicode is almost interconvertible (modulo illegal sequences of surrogates), while converting between UTF-16 and byte sequences is messy. This means that unconditionally using Word8 as the representation of filenames would be bad. I don't know a good solution. * * * Encouraged by Mono, for my language Kogut I adopted a hack that Unicode people hate: the possibility to use a modified UTF-8 variant where byte sequences which are illegal in UTF-8 are decoded into U+0000 followed by another character. This encoding is used as the default encoding instead of the true UTF-8 if the locale says that UTF-8 should be used and a particular environment variable is set (KO_UTF8_ESCAPED_BYTES=1). The encoding has the following properties: - Any byte sequence is decodable to a character sequence, which encodes back to the original byte sequence. - Different character sequences encode to different byte sequences (the U+0000 escape is valid only when it would be necessary). - It coincides with UTF-8 for valid UTF-8 byte sequences not containing 0x00, and character sequences not containing U+0000. It's a hack, and doesn't address other encodings than UTF-8, but it was good enough for me; it allows to maintain the illusion that OS strings are character strings. Alternatives were: * Use byte strings and character strings in different places, sometimes using a different type depending on the OS (Windows filenames would be character strings). Disadvantages: It's hard to write a filename to a text file. The API is more complex. The programmer must too often care about the kind of a string. * Fail when encountering byte strings which can't be decoded. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/