Re: String != [Char]

25 Mar 2012

      On Sat, Mar 24, 2012 at 7:16 PM, Johan Tibell  wrote:
...
On Sat, Mar 24, 2012 at 4:42 PM, Gabriel Dos Reis
 wrote:
...
Hmm, std::u16string, std::u23string, and std::wstring are C++ standard
types to process Unicode texts.
Note that at least u16string is too small to encode all of Unicode and
wstring might be as 16 bits is not enough to encode all of Unicode.
I think there is a confusion here.  A Unicode character is an abstract
entity.  For it to exist in some concrete form in a program, you need
an encoding.  The fact that char16_t is 16-bit wide is irrelevant to
whether it can be used in a representation of a Unicode text, just like
uint8_t (e.g. 'unsigned char') can be used to encode Unicode string
despite it being only 8-bit wide.   You do not need to make the
character type exactly equal to the type of the individual element
in the text representation.

Now, if you want to make a one-to-one correspondence between
individual elements in a std::basic_string and a Unicode character,
you would of course go for char32_t, which might be wasteful
depending on the circumstances.  Text processing languages like Perl
have long decided to de-emphasize one-character-at-a-time processing.
For most common cases, it is just inefficient.  But, I also understand
that the efficiency argument may not be strong in the context of Haskell.
However, I believe a particular attention must be paid to the correctness
of the semantics.

Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient
as far as text processing goes; you also need a localization at the
minimum.  It is the
combination of the two that gives some meaning to text representation
and operations.

I have been following the discussion, but I don't see anything said
about locales.

-- Gaby