The Nature of Char and String

Char in Haskell represents a Unicode character. I don't know exactly what its size is, but it must be at least 16 bits and maybe more. String would then share those properties. However, usually I'm accustomed to dealing with data in 8-bit words. So I have some questions: * If I use hPutStr on a string, is it guaranteed that the number of 8-bit bytes written equals (length stringWritten)? + If no, what is the representation written? I'm assuming UTF-8. How could I find out how many bytes were actually written? + If yes, what happens to the upper 8 bits? Are they simply stripped off? * If I run hGetChar, is it possible that it would consume more than one byte of input? How can I determine whether or not this has happend? * Does Haskell treat the "this is a Unicode file" marker special in any way? * Same questions on withCString and related String<->CString conversions. -- John

John Goerzen wrote:
Char in Haskell represents a Unicode character. I don't know exactly what its size is, but it must be at least 16 bits and maybe more. String would then share those properties.
However, usually I'm accustomed to dealing with data in 8-bit words. So I have some questions:
Char and String handling in Haskell is deeply broken. There's a discussion ongoing on this very list about fixing it (in the context of pathnames). But for now, Haskell's Char behaves like C's char with respect to I/O. This is unlikely ever to change (in the existing I/O interface) because it would break too much code. So the answers to your questions are:
* If I use hPutStr on a string, is it guaranteed that the number of 8-bit bytes written equals (length stringWritten)?
Yes, if the handle is opened in binary mode. No if not.
+ If yes, what happens to the upper 8 bits? Are they simply stripped off?
Yes.
* If I run hGetChar, is it possible that it would consume more than one byte of input?
No in binary mode, yes in text mode.
* Does Haskell treat the "this is a Unicode file" marker special in any way?
No.
* Same questions on withCString and related String<->CString conversions.
They all behave as if reading/writing a file in binary mode. -- Ben

On Sun, Jan 30, 2005 at 07:39:59PM +0000, Ben Rudiak-Gould wrote:
* If I use hPutStr on a string, is it guaranteed that the number of 8-bit bytes written equals (length stringWritten)?
Yes, if the handle is opened in binary mode. No if not.
Thank you for the informative response. If a file is opened in text mode, what encoding does Haskell grok on input? And what encoding does it generate on output? I'm assuming it's UTF-8 on output, but I don't know for sure. Thanks, John

On Sun, Jan 30, 2005 at 07:58:50PM -0600, John Goerzen wrote:
On Sun, Jan 30, 2005 at 07:39:59PM +0000, Ben Rudiak-Gould wrote:
* If I use hPutStr on a string, is it guaranteed that the number of 8-bit bytes written equals (length stringWritten)?
Yes, if the handle is opened in binary mode. No if not.
Thank you for the informative response.
If a file is opened in text mode, what encoding does Haskell grok on input? And what encoding does it generate on output? I'm assuming it's UTF-8 on output, but I don't know for sure.
I don't think so, it still just treats the Char as an eight bit C character, the only difference is that if it's opened as text on windows, your newlines get mangled. (...or treated properly, depending on your opinion.) -- David Roundy http://www.darcs.net

On Sun, Jan 30, 2005 at 07:58:50PM -0600, John Goerzen wrote:
On Sun, Jan 30, 2005 at 07:39:59PM +0000, Ben Rudiak-Gould wrote:
* If I use hPutStr on a string, is it guaranteed that the number of 8-bit bytes written equals (length stringWritten)?
Yes, if the handle is opened in binary mode. No if not.
Thank you for the informative response.
If a file is opened in text mode, what encoding does Haskell grok on input? And what encoding does it generate on output? I'm assuming it's UTF-8 on output, but I don't know for sure.
The ghc standard libraries only work with latin1, however, there are quite a few libraries out there for reading/writing text in the current locale (or a set format like UTF-8) so ghcs limitations are not too bad in practice if you need unicode. Some are even drop-in replacements for the standard prelude IO functions. It should also be said that these are limitations of the current haskell tools and not the language itself, the language specifies unicode and leaves the details of IO up to implementations. John -- John Meacham - ⑆repetae.net⑆john⑈

John Goerzen wrote:
* If I use hPutStr on a string, is it guaranteed that the number of 8-bit bytes written equals (length stringWritten)?
Yes, if the handle is opened in binary mode. No if not.
Thank you for the informative response.
If a file is opened in text mode, what encoding does Haskell grok on input? And what encoding does it generate on output? I'm assuming it's UTF-8 on output, but I don't know for sure.
Haskell doesn't specify an encoding. GHC and Hugs both assume
ISO-8859-1.
The only difference between binary and text modes is that text mode
converts between the platform's EOL conventions (i.e. LF on Unix, CRLF
on Windows) and LF.
--
Glynn Clements

Ben Rudiak-Gould wrote:
Char in Haskell represents a Unicode character. I don't know exactly what its size is, but it must be at least 16 bits and maybe more. String would then share those properties.
However, usually I'm accustomed to dealing with data in 8-bit words. So I have some questions:
Char and String handling in Haskell is deeply broken.
More accurately, string I/O (meaning all OS interfaces which take or
return strings, not just reading/writing files) in Haskell is deeply
broken.
The Haskell functions accept or return Strings but interface to OS
functions which (at least on Unix) deal with arrays of bytes (char*),
and the encoding issues are essentially ignored. If you pass strings
containing anything other than ISO-8859-1, you lose.
--
Glynn Clements

Glynn Clements
The Haskell functions accept or return Strings but interface to OS functions which (at least on Unix) deal with arrays of bytes (char*), and the encoding issues are essentially ignored. If you pass strings containing anything other than ISO-8859-1, you lose.
I'm not sure it's as bad as all that. You lose the correct Unicode code points (i.e. chars will have the wrong values, and strings may be the wrong lenght), but I think you will be able to get the same bytes out as you read in. So in that sense, Char-based IO is somewhat encoding neutral. So one can have Unicode both in IO and internally, it's just that you don't get both at the same time :-) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde
The Haskell functions accept or return Strings but interface to OS functions which (at least on Unix) deal with arrays of bytes (char*), and the encoding issues are essentially ignored. If you pass strings containing anything other than ISO-8859-1, you lose.
I'm not sure it's as bad as all that. You lose the correct Unicode code points (i.e. chars will have the wrong values, and strings may be the wrong lenght), but I think you will be able to get the same bytes out as you read in. So in that sense, Char-based IO is somewhat encoding neutral.
So one can have Unicode both in IO and internally, it's just that you don't get both at the same time :-)
That's the problem. Perl is similar: it uses the same strings for byte arrays and for Unicode strings whose characters happen to be Latin1. The interpretation sometimes depends on the function / library used, and sometimes on other libraries loaded. When I made an interface between Perl and my language Kogut (which uses Unicode internally and converts texts exchanged with the OS, even though conversion may fail e.g. for files not encoded using the locale encoding - I don't have a better design yet), I had trouble with converting Perl strings which have no characters above 0xFF. If I treat them as Unicode, then a filename passed between the two languages is interpreted differently. If I treat them as the locale encoding, then it's inconsistent and passing strings in both directions doesn't round-trip. So I'm currently treating them as Unicode. Perl's handling of Unicode is inconsistent with itself (e.g. for filenames containing characters above 0xFF), I don't think I made it more broken than it already is... -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
participants (7)
-
Ben Rudiak-Gould
-
David Roundy
-
Glynn Clements
-
John Goerzen
-
John Meacham
-
Ketil Malde
-
Marcin 'Qrczak' Kowalczyk