
On 25.04 17:26, John Meacham wrote:
On Wed, Apr 26, 2006 at 02:16:38AM +0300, Einar Karttunen wrote:
Using the Word8 API is not very pleasant, because all character constants etc are not Word8.
yeah, but using the version restricted to latin1 seems rather special case, I can't imagine (or certainly hope) it won't be used in general internally unless people are already doing low level stuff. In this day and age, I expect unicode to work pretty much everywhere.
Like in protocols where some segments may be compressed binary data? And they use ascii character based matching to distinguish header fields, which may have text data that is actually Utf8?
I am not saying we should kill the latin1 version, since there is interest in it, just that it doesn't fill the need for a general fast string replacement.
It mostly fills the "I want to use the Word8 module with nicer API" place. But most of the time it may not be Latin1. If we implement a Latin1 module then we should implement it properly. Also if we implement Latin1 there is a case for implementing Latin2-5 also. Of course the people really arguing for this module are not interested in a proper Latin1 implementation but just want the agnostic ascii superset. I think the wishes on the libraries list have been mainly: * UTF8 * Word8 interface * "Ascii superset" The easiest way seems to have three modules - one for each. Then we get to the naming part. I would like: * Data.ByteString.Word8 * Data.ByteString.Char8 * Data.ByteString.UTF And select your favorite and make Data.ByteString export that one. I think that could be the Word8 or the UTF one.
I don't see why. ascii is a subset of utf8, the routines building a packedstring from an ascii string or a utf8 string can be identical, if you know your string is ascii to begin with you can use an optimized routine but the end result is the same as if you used the general utf8 version.
Actually toUpper works differently on ascii + something in the high bytes and ISO-8859-1. Same with all the isXXX predicates, fortunately not a problem for things like whitespace.
the proper thing for PackedString is to make it behave exactly as the String instances behave, since it is suposed to be a drop in replacement. Which means the natuarl ordering based on the Char order and the toLower and toUpper from the libraries.
toUpper and toLower are the correct version in the standard and they use the unicode tables. The natural ordering by codepoint without any normalization is not very useful for text handling, but works for e.g. putting strings in a Map.
uncode collation, graphemes, normalization, and localized sorting can be provided as separate routines as another project (it would be nice to have them work on both Strings and PackedStrings, so perhaps they could be in a class?)
These are quite essential for really working with unicode characters. It didn't matter much before as Haskell didn't provide good ways to handle unicode chars with IO, but these are very important, otherwise it becomes hard to do many useful things with the parsed unicode characters. How are we supposed to process user input without normalization e.g. if we need to compare Strings for equivalence? But a simple UTF8 layer with more features added later is a good way. - Einar Karttunen