
On Sat, Mar 24, 2012 at 8:51 PM, Johan Tibell
On Sat, Mar 24, 2012 at 5:54 PM, Gabriel Dos Reis
wrote: I think there is a confusion here. A Unicode character is an abstract entity. For it to exist in some concrete form in a program, you need an encoding. The fact that char16_t is 16-bit wide is irrelevant to whether it can be used in a representation of a Unicode text, just like uint8_t (e.g. 'unsigned char') can be used to encode Unicode string despite it being only 8-bit wide. You do not need to make the character type exactly equal to the type of the individual element in the text representation.
Well, if you have a >21-bit type you can declare its value to be a Unicode code point (which are numbered.)
That is correct. Because not all Unicode points represent characters, and not all Unicode code point sequences represent valid characters, even if you have that >21-bit type T, the list type [T] would still not be a good string type.
Using a char* that you claim contain utf-8 encoded data is bad for safety, as there is no guarantee that that's indeed the case.
Indeed, and that is why a Text should be an abstract datatype, hiding the concrete implementation away from the user.
Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient as far as text processing goes; you also need a localization at the minimum. It is the combination of the two that gives some meaning to text representation and operations.
text does that via ICU. Some operations would be possible without using the locale, if it wasn't for those Turkish i:s. :/
yeah, 7 bits should be enough for every character ;-) -- Gaby