Re: String != [Char]

25 Mar 2012


      On Sat, Mar 24, 2012 at 8:51 PM, Johan Tibell  wrote:
...
On Sat, Mar 24, 2012 at 5:54 PM, Gabriel Dos Reis
 wrote:
...
I think there is a confusion here.  A Unicode character is an abstract
entity.  For it to exist in some concrete form in a program, you need
an encoding.  The fact that char16_t is 16-bit wide is irrelevant to
whether it can be used in a representation of a Unicode text, just like
uint8_t (e.g. 'unsigned char') can be used to encode Unicode string
despite it being only 8-bit wide.   You do not need to make the
character type exactly equal to the type of the individual element
in the text representation.
Well, if you have a >21-bit type you can declare its value to be a
Unicode code point (which are numbered.)
That is correct.  Because not all Unicode points represent characters,
and not all Unicode code point sequences represent valid characters,
even if you have that >21-bit type T, the list type [T] would still not be a
good string type.
...
Using a char* that you claim
contain utf-8 encoded data is bad for safety, as there is no guarantee
that that's indeed the case.
Indeed, and that is why a Text should be an abstract datatype, hiding
the concrete implementation away from the user.
...
...
Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient
as far as text processing goes; you also need a localization at the
minimum.  It is the
combination of the two that gives some meaning to text representation
and operations.
text does that via ICU. Some operations would be possible without
using the locale, if it wasn't for those Turkish i:s. :/
yeah, 7 bits should be enough for every character ;-)

-- Gaby