
I think this is not yet discussed on the wiki:
From the recent post to the Haskell list:
-------- Forwarded Message --------
From: Krasimir Angelov

On Fri, 2006-02-03 at 12:13 +0000, Ross Paterson wrote:
On Fri, Feb 03, 2006 at 12:06:28PM +0000, Axel Simon wrote:
I think this is not yet discussed on the wiki:
[FilePath as String or ADT]
The issue (and the related one with program arguments and environment variables) is mentioned under CharAsUnicode.
Yes, and I suppose not being opaque about a file name (i.e. FilePath = [Word8]) is superior. My fault. So why is the whole Unicode proposal under "adopt: none"? Did nobody look at that yet? Axel.

On Fri, Feb 03, 2006 at 12:24:28PM +0000, Axel Simon wrote:
Yes, and I suppose not being opaque about a file name (i.e. FilePath = [Word8]) is superior.
Maybe. You might want [Word8] under Unix and [Word16] under Win32.
So why is the whole Unicode proposal under "adopt: none"? Did nobody look at that yet?
No, there's no formal proposal yet. It would probably be two or three proposals (source, I/O, strings).

On 2006-02-03, Ross Paterson
On Fri, Feb 03, 2006 at 12:24:28PM +0000, Axel Simon wrote:
Yes, and I suppose not being opaque about a file name (i.e. FilePath = [Word8]) is superior.
Maybe. You might want [Word8] under Unix and [Word16] under Win32.
Right. I think "Generic File Handling" should not be considered the base, but layered on top of Unix, Win32 and possibly MacOS, if unix doesn't cover that. -- Aaron Denney -><-

Axel Simon
The solution of representing a file name abstractly is also used by the Java libraries.
I think it is not. Besides using Java UTF-16 strings for filenames, there is the File class, but it also uses Java strings. The documentation of listFiles() says that each resulting File is made using the File(File, String) constructor. The GNU Java implementation uses a single Java string inside it. On Windows the OS uses UTF-16 strings natively rather than byte sequences. UTF-16 and Unicode is almost interconvertible (modulo illegal sequences of surrogates), while converting between UTF-16 and byte sequences is messy. This means that unconditionally using Word8 as the representation of filenames would be bad. I don't know a good solution. * * * Encouraged by Mono, for my language Kogut I adopted a hack that Unicode people hate: the possibility to use a modified UTF-8 variant where byte sequences which are illegal in UTF-8 are decoded into U+0000 followed by another character. This encoding is used as the default encoding instead of the true UTF-8 if the locale says that UTF-8 should be used and a particular environment variable is set (KO_UTF8_ESCAPED_BYTES=1). The encoding has the following properties: - Any byte sequence is decodable to a character sequence, which encodes back to the original byte sequence. - Different character sequences encode to different byte sequences (the U+0000 escape is valid only when it would be necessary). - It coincides with UTF-8 for valid UTF-8 byte sequences not containing 0x00, and character sequences not containing U+0000. It's a hack, and doesn't address other encodings than UTF-8, but it was good enough for me; it allows to maintain the illusion that OS strings are character strings. Alternatives were: * Use byte strings and character strings in different places, sometimes using a different type depending on the OS (Windows filenames would be character strings). Disadvantages: It's hard to write a filename to a text file. The API is more complex. The programmer must too often care about the kind of a string. * Fail when encountering byte strings which can't be decoded. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk wrote:
Encouraged by Mono, for my language Kogut I adopted a hack that Unicode people hate: the possibility to use a modified UTF-8 variant where byte sequences which are illegal in UTF-8 are decoded into U+0000 followed by another character.
I don't like the idea of using U+0000, because it looks like an ASCII control character, and in any case has a long tradition of being used for something else. Why not use a code point that can't result from decoding a valid UTF-8 string? U+FFFF (EF BF BF) is illegal in UTF-8, for example, and I don't think it's legal UTF-16 either. This would give you round-tripping for all legal UTF-8 and UTF-16 strings. Or you could use values from U+DC00 to U+DFFF, which definitely aren't legal UTF-8 or UTF-16. There's plenty of room there to encode each invalid UTF-8 byte in a single word, instead of a sequence of two words. A much cleaner solution would be to reserve part of the private use area, say U+109780 through U+1097FF (DBE5 DF80 through DBE5 DFFF). There's a pretty good chance you won't collide with anyone. It's too bad Unicode hasn't set aside 128 code points for this purpose. Maybe we should grab some unassigned code points, document them, and hope it catches on. There's a lot to be said for any encoding, however nasty, that at least takes ASCII to ASCII. Often people just want to inspect the ASCII portions of a string while leaving the rest untouched (e.g. when parsing "--output-file=¡£ª±ïñ¹!.txt"), and any encoding that permits this is good enough.
Alternatives were:
* Use byte strings and character strings in different places, sometimes using a different type depending on the OS (Windows filenames would be character strings).
* Fail when encountering byte strings which can't be decoded.
Another alternative is to simulate the existence of a UTF-8 locale on Win32. Represent filenames as byte strings on both platforms; on NT convert between UTF-8 and UTF-16 when interfacing with the outside; on 9x either use the ANSI/OEM encoding internally or convert between UTF-8 and the ANSI/OEM encoding. I suppose NT probably doesn't check that the filenames you pass to the kernel are valid UTF-16, so there's some possibility that files with illegal names might be accessible to other applications but not to Haskell applications. But I imagine such files are much rarer than Unix filenames that aren't legal in the current locale. And you could still use the private-encoding trick if not. -- Ben

Ben Rudiak-Gould
I don't like the idea of using U+0000, because it looks like an ASCII control character, and in any case has a long tradition of being used for something else. Why not use a code point that can't result from decoding a valid UTF-8 string? U+FFFF (EF BF BF) is illegal in UTF-8,
It is legal. It's meaningless for data exchange, but OSes don't prevent creating a file with UTF-8-encoded U+FFFF in its name, and a true UTF-8 decoder interprets that byte sequence as U+FFFF. U+0000 and surrogates are the only code points which can't appear in true UTF-8-encoded filenames, and thus using them is necessary to be fully compatible with true UTF-8.
Or you could use values from U+DC00 to U+DFFF,
Right, but somehow I like U+0000 more.
A much cleaner solution would be to reserve part of the private use area, say U+109780 through U+1097FF (DBE5 DF80 through DBE5 DFFF).
This would not be fully compatible with true UTF-8, because these characters already have a representation in UTF-8.
There's a lot to be said for any encoding, however nasty, that at least takes ASCII to ASCII.
Right, but '\0' can't appear in filenames. My conversion routines for strings exchanged with C assume that the default encoding leaves ASCII except NUL unchanged. NUL has to be special-cased anyway because in most cases it's disallowed in a C string. So the fast path checks whether all characters are U+0001..U+007F, and if so, the string is used directly by C (my representation of strings uses one byte per character with '\0' at the end if the string has no characters above U+00FF). Otherwise it's encoded using the dynamically specified default encoding, and there is an additional check whether the *resulting* string contains no '\0', which is an error. Conversion of file contents doesn't take shortcuts, doesn't assume anything about ASCII compatibility. It always works on buffers containing 4-byte characters. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
participants (5)
-
Aaron Denney
-
Axel Simon
-
Ben Rudiak-Gould
-
Marcin 'Qrczak' Kowalczyk
-
Ross Paterson