
On Sat, Mar 24, 2012 at 7:16 PM, Johan Tibell
On Sat, Mar 24, 2012 at 4:42 PM, Gabriel Dos Reis
wrote: Hmm, std::u16string, std::u23string, and std::wstring are C++ standard types to process Unicode texts.
Note that at least u16string is too small to encode all of Unicode and wstring might be as 16 bits is not enough to encode all of Unicode.
I think there is a confusion here. A Unicode character is an abstract entity. For it to exist in some concrete form in a program, you need an encoding. The fact that char16_t is 16-bit wide is irrelevant to whether it can be used in a representation of a Unicode text, just like uint8_t (e.g. 'unsigned char') can be used to encode Unicode string despite it being only 8-bit wide. You do not need to make the character type exactly equal to the type of the individual element in the text representation. Now, if you want to make a one-to-one correspondence between individual elements in a std::basic_string and a Unicode character, you would of course go for char32_t, which might be wasteful depending on the circumstances. Text processing languages like Perl have long decided to de-emphasize one-character-at-a-time processing. For most common cases, it is just inefficient. But, I also understand that the efficiency argument may not be strong in the context of Haskell. However, I believe a particular attention must be paid to the correctness of the semantics. Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient as far as text processing goes; you also need a localization at the minimum. It is the combination of the two that gives some meaning to text representation and operations. I have been following the discussion, but I don't see anything said about locales. -- Gaby