Re: [Haskell-cafe] Re: Writing binary files?

Glynn Clements
Gabriel Ebner wrote:
One more reason to fix the I/O functions to handle encodings and have a seperate/underlying binary I/O API.
The problem is that we also need to fix them to handle *no encoding*.
What are you proposing here? Making the breakage even worse by specifying a text based api that uses "no encoding"? Having a seperate byte based api is far better. If you don't know the encoding, all you have is bytes, no text.
Also, binary data and text aren't disjoint. Everything is binary; some of it is *also* text.
No, it isn't. Everything is binary (read: we need a byte based io library), after decoding and only after decoding it becomes text (read: we need explicit support for decoding and probably a convenience layer that looks like the old io library).
String's are a list of unicode characters, [Word8] is a list of bytes.
And what comes out of (and goes into) most core library functions is the latter.
So System.Directory needs to be specified in terms of bytes, too. Looks like a clean solution to me. Regards, Udo. ________________________________________________________________ Verschicken Sie romantische, coole und witzige Bilder per SMS! Jetzt neu bei WEB.DE FreeMail: http://freemail.web.de/?mc=021193

Udo Stenzel wrote:
One more reason to fix the I/O functions to handle encodings and have a seperate/underlying binary I/O API.
The problem is that we also need to fix them to handle *no encoding*.
What are you proposing here? Making the breakage even worse by specifying a text based api that uses "no encoding"?
No. I'm suggesting that many of the I/O functions shouldn't be treating their arguments or return values as text.
Having a seperate byte based api is far better. If you don't know the encoding, all you have is bytes, no text.
My point is that many of the existing functions should be changed to use bytes instead of text (not separate byte/char versions). E.g.: type FilePath = [Byte] If you have a reason to treat a FilePath as text, then you convert it. E.g. names <- getDirectoryContents dir let namesT = map (toString localeEncoding) names We don't need a separate getDirectoryContentsAsText, and we certainly don't want that to be the default. For stream I/O, then having both text and binary read/write functions makes sense.
String's are a list of unicode characters, [Word8] is a list of bytes.
And what comes out of (and goes into) most core library functions is the latter.
So System.Directory needs to be specified in terms of bytes, too. Looks like a clean solution to me.
Sure. But I'm looking for a solution which doesn't involve re-writing
everything, and which won't result in lots of programs suddenly
becoming unreliable if the hardwired default ISO-8859-1 conversion is
changed.
--
Glynn Clements

On Thu, 16 Sep 2004, Udo Stenzel wrote:
Having a seperate byte based api is far better. If you don't know the encoding, all you have is bytes, no text.
Okay, after reading large chunks of this discussion, I'm going to rock the boat a bit by suggesting that bytes *are* text, and *do* belong in the Char type, and hence that the current binary file API is actually correct, after a fashion. In fact, I think that we can resolve many of the problems of this thread by abandoning the conceptual distinction between characters and bytes. Suppose I invoke gcc -o XXX YYY.c where XXX and YYY are strings of Japanese characters. It has been pointed out that if GCC treats its filename arguments as opaque byte strings to be passed back to the appropriate file opening functions, then it will work even if the current locale isn't Japanese. But that's only true on Posix- like systems. On NT, filenames are made of Unicode code points, and argv is encoded according to the current locale. If GCC uses argv, it will fail on the example above. I've run into this problem many times on my desktop XP box, which uses a US-English locale but contains some filenames with Japanese characters in them. But in any case GCC's arguments aren't really opaque: it needs to check each argument to see if it's an option, and it needs to look at the extensions of files like YYY.c to figure out which subprogram to invoke. Nevertheless, the opaque-filename approach does work on Posix, because -- this is the important bit -- the characters GCC cares about (like '-', 'o', '.', and 'c') have the same representation in every encoding. In other words, the character encoding is neither transparent nor opaque to GCC, but sort of "band-limited": it can understand the values from 0 to 127, but the higher values are mysterious to it. They could be Latin-1 code points; they could be EUC half-characters; they could be Unicode code points. It doesn't know, and it doesn't *need* to know. It will fail if given an encoding which doesn't follow this rule (e.g. EBCDIC). We can make GCC (were it implemented in Haskell) work with all filenames on both major platforms without platform-specific code by representing command-line arguments and pathnames as Strings = [Char]s, where Char is defined as the byte values 0-255 on Posix, but the UTF-16 values on Win32. Clearly this is very fragile, but the type system provides a solution: newtype {- TransASCIIEncoding a => -} Char a = Chr Word32 type String a = [Char a] class TransASCIIEncoding a where maxValueUsedByEncoding :: Word32 instance TransASCIIEncoding Unicode where ... instance TransASCIIEncoding UTF16 where ... instance TransASCIIEncoding UTF8 where ... instance TransASCIIEncoding GenericByte where ... 'x' :: Char a '\u1234' :: Char Unicode '\q789' :: Char WeirdCompilerSupportedEncoding instance (TransASCIIEncoding a) => Bounded (Char a) where minBound = Chr 0 maxBound = Chr maxValueUsedByEncoding class CharTranscoding a b where transcode :: CharacterString a ord :: Character a -> Maybe Int -- Nothing if arg isn't ASCII ordUnicode :: Character Unicode -> Int Obvious problems: backward compatibility and codings like ISO 2022 and Shift-JIS which break the fundamental assumption. I don't think either problem is fatal. A more flexible subtyping mechanism would be nice, so that (e.g.) byte-writing functions could take any Char type with a sufficiently small maxValue. -- Ben
participants (3)
-
Ben Rudiak-Gould
-
Glynn Clements
-
Udo Stenzel