Re: [Haskell-cafe] The Nature of Char and String

2 Feb 2005

      Ketil Malde  writes:
...
...
The Haskell functions accept or return Strings but interface to OS
functions which (at least on Unix) deal with arrays of bytes (char*),
and the encoding issues are essentially ignored. If you pass strings
containing anything other than ISO-8859-1, you lose.
I'm not sure it's as bad as all that. You lose the correct Unicode
code points (i.e. chars will have the wrong values, and strings may be
the wrong lenght), but I think you will be able to get the same bytes
out as you read in.  So in that sense, Char-based IO is somewhat
encoding neutral.
So one can have Unicode both in IO and internally, it's just that you
don't get both at the same time :-)
That's the problem. Perl is similar: it uses the same strings for byte
arrays and for Unicode strings whose characters happen to be Latin1.
The interpretation sometimes depends on the function / library used,
and sometimes on other libraries loaded.

When I made an interface between Perl and my language Kogut (which
uses Unicode internally and converts texts exchanged with the OS,
even though conversion may fail e.g. for files not encoded using the
locale encoding - I don't have a better design yet), I had trouble
with converting Perl strings which have no characters above 0xFF.
If I treat them as Unicode, then a filename passed between the two
languages is interpreted differently. If I treat them as the locale
encoding, then it's inconsistent and passing strings in both
directions doesn't round-trip.

So I'm currently treating them as Unicode. Perl's handling of Unicode
is inconsistent with itself (e.g. for filenames containing characters
above 0xFF), I don't think I made it more broken than it already is...

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: [Haskell-cafe] The Nature of Char and String

Marcin 'Qrczak' Kowalczyk