
On Wed, Apr 26, 2006 at 02:16:38AM +0300, Einar Karttunen wrote:
Using the Word8 API is not very pleasant, because all character constants etc are not Word8.
yeah, but using the version restricted to latin1 seems rather special case, I can't imagine (or certainly hope) it won't be used in general internally unless people are already doing low level stuff. In this day and age, I expect unicode to work pretty much everywhere.
This is very useful for many purposes and does not mean that there should not be a fancy UTF8 module. Rather than arguing about killing this, wouldn't it be more productive to create the UTF8 module?
I am not saying we should kill the latin1 version, since there is interest in it, just that it doesn't fill the need for a general fast string replacement.
but note, do the people that want latin1 just need ASCII? because it should be noted that if we have a UTF8 PackedString, then we can make ASCII-specific access routines that are just as fast as the ones in the Latin1 variety without giving up the ability to store full unicode values in the string.
Case conversions and ordering need to be different. Thus we need to newtype things to avoid having two conflicting Ord instances. The UTF8 layer should provide:
I don't see why. ascii is a subset of utf8, the routines building a packedstring from an ascii string or a utf8 string can be identical, if you know your string is ascii to begin with you can use an optimized routine but the end result is the same as if you used the general utf8 version.
* Unicode toUpper/toLower * Unicode collation (UCA) for Ord * Graphemes (see Perl6 for good ways to do this) * Normalisation
well, none of these are UTF8 specific, we should not worry about the encoding and just think of what 'PackedString' should do, the encoding is unimportant to the API and semantics, the fact that you just happen to be able to quickly convert to/from ascii and utf8 should be the only visible difference in behavior. the proper thing for PackedString is to make it behave exactly as the String instances behave, since it is suposed to be a drop in replacement. Which means the natuarl ordering based on the Char order and the toLower and toUpper from the libraries. uncode collation, graphemes, normalization, and localized sorting can be provided as separate routines as another project (it would be nice to have them work on both Strings and PackedStrings, so perhaps they could be in a class?) certainly a newtype LocalizedPackedString = LocalizedPackedString PackedString with different instances would be a useful thing too. but this should be a separate but related project from just getting a fast string replacement. (as in, it shouldn't hold up PackedString development) John -- John Meacham - ⑆repetae.net⑆john⑈