
Simon Marlow wrote:
Which is why I'm suggesting changing Char to be a byte, so that we can have the basic, robust API now and wait for the more advanced API, rather than having to wait for a usable API while people sort out all of the issues.
An easier way is just to declare that the existing API assumes a Latin-1 encoding consistently. Later we might add a way to let the application pick another encoding, or request that the I/O library uses the locale encoding.
But how do you do that without breaking stuff? If the application changes the encoding to UTF-8 (either explicitly, or by using the locale's encoding when it happens to be UTF-8), then code such as:
[filename] <- getArgs openFile filename ReadMode
will fail if filename isn't a valid UTF-8 sequence. Similarly for the other cases where the OS accepts/returns byte strings but the Haskell interface uses String.
And that's the correct behaviour, isn't it?
No. The correct behaviour is to keep such data as byte strings. Otherwise it's going to be hard to write robust programs if the hard-wired ISO-8859-1 encoding is ever changed. In the current implementation, getArgs gets a list of bytes from argv[], which it converts to a String. The String is passed to openFile, which converts it back to a list of bytes which are then passed to open(). Thus the list of bytes is effectively fed through (encode . decode). For ISO-8859-*, this is the identity function. For UTF-8, it's a subfunction of the identity function, i.e. it either returns its input or it fails. I don't see what is to be gained by having it fail. It would be preferable to just pass the byte string directly from argv[] to open().
I'm less concerned about the handling of streams, as you can reasonably add a way to change the encoding before any data has been read or written. I'm more concerned about FilePaths, argv, the environment etc.
Yes, these are interesting issues. Filenames are stored as character strings on some OSs (eg. Windows) and byte strings on others. So the Haskell portable API should probably use String, and do decoding based on the locale (if the programmer asks for it).
Argv and the environment - I don't know. Windows CreateProcess() allows these to be UTF-16 strings, but I don't know what encoding/decoding happens between CreateProcess() and what the target process sees in its argv[] (can't be bothered to dig through MSDN right now). I suspect these should be Strings in Haskell too, with appropriate decoding/encoding happening under the hood.
I suspect that Windows will convert them according to the active
codepage, so that OpenFileA(argv[i], ...) works.
--
Glynn Clements