
Udo Stenzel wrote:
Note that this needs to include all of the core I/O functions, not just reading/writing streams. E.g. FilePath is currently an alias for String, but (on Unix, at least) filenames are strings of bytes, not characters. Ditto for argv, environment variables, possibly other cases which I've overlooked.
I don't think so. They all are sequences of CChars, and C isn't particularly known for keeping bytes and chars apart.
CChar is a C "char", which is a byte (not necessarily an octet, and not necessarily a character either).
I believe, Windows NT has (alternate) filename handling functions that use unicode stringsr.
Almost all of the Win32 API functions which handle strings exist in both char and wide-char versions.
This would strengthen the view that a filename is a sequence of characters.
It would be reasonable to make FilePath equivalent to String on Windows, but not on Unix.
Ditto for argv, env, whatnot; they are typically entered from the shell and therefore are characters in the local encoding.
Both argv and envp are char**, i.e. lists of byte strings. There is no guarantee that the values can be succesfully decoded according the locale's encoding. The environment is typically set on login, and inherited thereafter. It's typically limited to ASCII, but this isn't guaranteed. Similarly, a program may need to access files which he didn't create, and which have filenames which aren't valid strings according to his locale. E.g. a user may choose a locale which uses UTF-8, but the sysadmin has installed files with ISO-8859-1 filenames. If a Haskell program tries to coerce everything to String using the user's locale, the program will be unable to access such files.
3. The default encoding is settable from Haskell, defaults to ISO-8859-1.
Agreed.
Oh no, please don't do that. A global, settable encoding is, well, dys-functional. Hidden state makes programs hard to understand and Haskell imho shouldn't go that route.
There's already plenty of hidden state in the system libraries upon which a Haskell program depends.
And please don't introduce the notion of a "default" encoding.
It isn't an issue of *introducing* it. Many Haskell98 functions (i.e. much of IO, System and Directory) accept or return Strings, yet have to be implemented on top of an OS which accepts or provides "char*"s. There *has* to be an encoding between the two, and currently it's hardwired to ISO-8859-1. The alternative to a global encoding is for *all* functions which interface to the OS to always either accept or return [CChar] or, if they accept or return Strings, accept an additional argument which specifies the encoding. Also, bear in mind that the functions under discussion are all I/O functions which, by their nature, deal with state (e.g. the state of the filesystem).
I'd like to see the following:
- Duplicate the IO library. The duplicate should work with [Byte] everywhere where the old library uses String. Byte is some suitable unsigned integer, on most (all?) platforms this will be Word8
Technically it should be CChar. However, it's fairly safe to assume that a byte will always be 8 bits; almost nobody writes code which works on systems where it isn't. However: if we go this route, I suspect that we will also need a convenient method for specifying literal byte strings in Haskell source code.
- Provide an explicit conversion between encodings. A simple conversion of type [Word8] -> String would suit me, iconv would provide all that is needed.
For the general case, you need to allow for stateful encodings (e.g. ISO-2022). Actually, even UTF-8 needs to deal with state if you need to decode byte streams which are split into chunks and the breaks can occur in the middle of a character (e.g. if you're using non-blocking I/O).
- iconv takes names of encodings as arguments. Provide some names as constants: one name for the internal encoding (probably UCS4), one name for the canonical external encoding (probably locale dependent).
- Then redefine the old IO API in terms of the new API and appropriate conversions.
The old API requires an implicit encoding. The OS gives accepts or provides bytes, the old API functions accept or return Chars, and the old API functions don't accept an encoding argument. This is why we are (or, at least, I am) suggesting a settable current encoding. Because the existing API *needs* a current encoding, and I'm assuming that there may be some reluctance to just discarding it completely.
While we're at it, do away with the annoying CR/LF problem on Windows, this should simply be part of the local encoding. This way file can always be opened as binary, hSetBinary can be dropped. (This won't wont on ancient platforms where text files and binary files are genuinely different, but these are probably not interesting anyway.)
Apart from OS-specific issues, it would be useful to treat EOL conventions as part of the encoding. E.g. for network protocols which use CRLF, it would be useful to be able to set CRLF as the EOL convention then use e.g. hPutStrLn to write lines.
The same thoughts apply to filenames. Make them [Word8] and convert explicitly.
Well, it's arguable that they should be [Word8] on Unix and String on Windows. I suppose that you could handle the Windows case by automatically converting to/from UTF-8.
By the way, I think a path should be a list of names (that is of type [[Word8]]) and the library would be concerned with putting in the right path separator. Add functions to read and show pathnames in the local conventions and we'll never need to worry about path separators again.
There would certainly be some advantages to making FilePath an abstract type, but there are quite a few corner cases to deal with.
There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings?
Well, then you did something stupid, didn't you? If you don't know the encoding you shouldn't decode anything. That's a strong point against any implicit decoding, I think.
Yes. However, I suspect that we will have to live with some of the
mistakes of the past, i.e. using String in the I/O functions.
--
Glynn Clements