Fri, 5 Oct 2001 23:23:50 +1000, Andrew J Bromage
There is a set of one million (more correctly, 1M) Unicode characters which are only accessible using surrogate pairs (i.e. two UTF-16 codes). There are currently none of these codes assigned,
This information is out of date. AFAIR about 40000 of them is assigned. Most for Chinese (current, not historic).
So rare, in fact, that the cost of strings taking up twice the space that the currently do simply isn't worth the cost.
In Haskell strings already have high overhead. In GHC a Char# value (inside Char object) always takes the same size as the pointer (32 or 64 bits), no matter how much of it is used.
It just goes to show that strings are not merely arrays of characters like some languages would have you believe.
In Haskell String = [Char]. It's true that Char values don't necessarily correspond to glyphs, but Strings are composed of Chars. -- __("< Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTÊPCZA QRCZAK
G'day all. On Fri, Oct 05, 2001 at 06:17:26PM +0000, Marcin 'Qrczak' Kowalczyk wrote:
This information is out of date. AFAIR about 40000 of them is assigned. Most for Chinese (current, not historic).
I wasn't aware of this. Last time I looked was Unicode 3.0. Thanks for the update.
In Haskell String = [Char].
I'll concede that String and [Char] are identical as far as the programmer is concerned. :-) There was some research 10+ years ago about alternative representations for lists which were semantically identical but a little more efficient in memory use. Even if you don't go that far (it is fiddly), constant strings, for example, could be representable as UTF-16/UTF-8/whatever along with some machinery to generate the list on demand. Char objects could be implemented as flyweights. Lots of possibilities. Cheers, Andrew Bromage
participants (2)
-
Andrew J Bromage -
Marcin 'Qrczak' Kowalczyk