Re: [Haskell-cafe] Writing binary files?

14 Sep 2004

      Udo Stenzel wrote:
...
...
Note that this needs to include all of the core I/O functions, not
just reading/writing streams. E.g. FilePath is currently an alias for
String, but (on Unix, at least) filenames are strings of bytes, not
characters. Ditto for argv, environment variables, possibly other
cases which I've overlooked.
I don't think so.  They all are sequences of CChars, and C isn't
particularly known for keeping bytes and chars apart.
CChar is a C "char", which is a byte (not necessarily an octet, and
not necessarily a character either).
...
I believe,
Windows NT has (alternate) filename handling functions that use unicode
stringsr.
Almost all of the Win32 API functions which handle strings exist in
both char and wide-char versions.
...
This would strengthen the view that a filename is a sequence
of characters.
It would be reasonable to make FilePath equivalent to String on
Windows, but not on Unix.
...
Ditto for argv, env, whatnot; they are typically entered
from the shell and therefore are characters in the local encoding.
Both argv and envp are char**, i.e. lists of byte strings. There is no
guarantee that the values can be succesfully decoded according the
locale's encoding.

The environment is typically set on login, and inherited thereafter. 
It's typically limited to ASCII, but this isn't guaranteed. Similarly,
a program may need to access files which he didn't create, and which
have filenames which aren't valid strings according to his locale.

E.g. a user may choose a locale which uses UTF-8, but the sysadmin has
installed files with ISO-8859-1 filenames. If a Haskell program tries
to coerce everything to String using the user's locale, the program
will be unable to access such files.
...
...
...
3. The default encoding is settable from Haskell, defaults to
   ISO-8859-1.
Agreed.
Oh no, please don't do that.  A global, settable encoding is, well,
dys-functional.  Hidden state makes programs hard to understand and
Haskell imho shouldn't go that route.
There's already plenty of hidden state in the system libraries upon
which a Haskell program depends.
...
And please don't introduce the notion of a "default" encoding.
It isn't an issue of *introducing* it. Many Haskell98 functions (i.e. 
much of IO, System and Directory) accept or return Strings, yet have
to be implemented on top of an OS which accepts or provides "char*"s. 
There *has* to be an encoding between the two, and currently it's
hardwired to ISO-8859-1.

The alternative to a global encoding is for *all* functions which
interface to the OS to always either accept or return [CChar] or, if
they accept or return Strings, accept an additional argument which
specifies the encoding.

Also, bear in mind that the functions under discussion are all I/O
functions which, by their nature, deal with state (e.g. the state of
the filesystem).
...
I'd like to see the following:
- Duplicate the IO library.  The duplicate should work with [Byte]
  everywhere where the old library uses String.  Byte is some suitable
  unsigned integer, on most (all?) platforms this will be Word8
Technically it should be CChar. However, it's fairly safe to assume
that a byte will always be 8 bits; almost nobody writes code which
works on systems where it isn't.

However: if we go this route, I suspect that we will also need a
convenient method for specifying literal byte strings in Haskell
source code.
...
- Provide an explicit conversion between encodings.  A simple conversion
  of type [Word8] -> String would suit me, iconv would provide all that
  is needed.
For the general case, you need to allow for stateful encodings (e.g. 
ISO-2022). Actually, even UTF-8 needs to deal with state if you need
to decode byte streams which are split into chunks and the breaks can
occur in the middle of a character (e.g. if you're using non-blocking
I/O).
...
- iconv takes names of encodings as arguments.  Provide some names as
  constants: one name for the internal encoding (probably UCS4), one
  name for the canonical external encoding (probably locale dependent).
- Then redefine the old IO API in terms of the new API and appropriate
  conversions.
The old API requires an implicit encoding. The OS gives accepts or
provides bytes, the old API functions accept or return Chars, and the
old API functions don't accept an encoding argument.

This is why we are (or, at least, I am) suggesting a settable current
encoding. Because the existing API *needs* a current encoding, and I'm
assuming that there may be some reluctance to just discarding it
completely.
...
While we're at it, do away with the annoying CR/LF problem on Windows,
this should simply be part of the local encoding.  This way file can
always be opened as binary, hSetBinary can be dropped.  (This won't wont
on ancient platforms where text files and binary files are genuinely
different, but these are probably not interesting anyway.)
Apart from OS-specific issues, it would be useful to treat EOL
conventions as part of the encoding. E.g. for network protocols which
use CRLF, it would be useful to be able to set CRLF as the EOL
convention then use e.g. hPutStrLn to write lines.
...
The same thoughts apply to filenames.  Make them [Word8] and convert
explicitly.
Well, it's arguable that they should be [Word8] on Unix and String on
Windows. I suppose that you could handle the Windows case by
automatically converting to/from UTF-8.
...
By the way, I think a path should be a list of names (that
is of type [[Word8]]) and the library would be concerned with putting in
the right path separator.  Add functions to read and show pathnames in
the local conventions and we'll never need to worry about path
separators again.
There would certainly be some advantages to making FilePath an
abstract type, but there are quite a few corner cases to deal with.
...
...
There are limits to the extent to which this can be achieved. E.g. 
what happens if you set the encoding to UTF-8, then call
getDirectoryContents for a directory which contains filenames which
aren't valid UTF-8 strings?
Well, then you did something stupid, didn't you?  If you don't know the
encoding you shouldn't decode anything.  That's a strong point against
any implicit decoding, I think.
Yes. However, I suspect that we will have to live with some of the
mistakes of the past, i.e. using String in the I/O functions.

-- 
Glynn Clements