
On Wed, Apr 26, 2006 at 04:48:52AM +0300, Einar Karttunen wrote:
I would like: * Data.ByteString.Word8 * Data.ByteString.Char8 * Data.ByteString.UTF
And select your favorite and make Data.ByteString export that one. I think that could be the Word8 or the UTF one.
ByteString should be the pure Word8 version. the others can be based on it. ByteString is quite a useful data type independent of anything to do with strings. I'd like to see Data.PackedString be what you are calling Data.ByteString.UTF and PackedString _specifically_ be a drop-in replacement for String with an abstract internal representation and should behave the same as String except when it comes to time and space. I want to be able to just change a few types and routines to PackedString from String in a library and be guarenteed I am not affecting the meaning of a program. (or vice versa) though, I do much much prefer the 'Char8' term to 'Latin1'. I think it better represents what it does. just 'Chars truncated to 8 bits' while 'latin1' might have other unintended connotations. The fact that the standard routines will interpret them as latin1 can be infered from the fact that the standard routines interpret Chars as unicode code points. In particular, if you do something wacky where you don't store unicode values in a 'Char' it doesn't magically become 'Latin1' just because you store it in a latin1 string, it just becomes whatever you put in truncated to 8 bits and hopefully you know what you are doing.
I don't see why. ascii is a subset of utf8, the routines building a packedstring from an ascii string or a utf8 string can be identical, if you know your string is ascii to begin with you can use an optimized routine but the end result is the same as if you used the general utf8 version.
Actually toUpper works differently on ascii + something in the high bytes and ISO-8859-1. Same with all the isXXX predicates, fortunately not a problem for things like whitespace.
I am not sure what you mean, the data would always be utf8 full unicode values in a PackedString, there would just be efficient ways to pull in data you know is ascii since it can just use a memcpy rather than recoding it from whatever format it is in. The fact that it happens to just contain values < 128 won't make a different for subsequent handling of the string. (except perhaps some routines will be faster). when I say ASCII here, I just mean a utf8 string where all values happen to be < 128, which is happily binary compatable with ASCII.
the proper thing for PackedString is to make it behave exactly as the String instances behave, since it is suposed to be a drop in replacement. Which means the natuarl ordering based on the Char order and the toLower and toUpper from the libraries.
toUpper and toLower are the correct version in the standard and they use the unicode tables. The natural ordering by codepoint without any normalization is not very useful for text handling, but works for e.g. putting strings in a Map.
yeah, and it is fast. I always thought we should have two Ord classes, one for human digestable ordering and the other for fast implementation dependent ordering for use only in things like Map and Set. but that is a different issue. in any case, the point I was trying to make is that PackedString should behave exactly like String, whether the instances for String are doing the right thing is a different matter.
uncode collation, graphemes, normalization, and localized sorting can be provided as separate routines as another project (it would be nice to have them work on both Strings and PackedStrings, so perhaps they could be in a class?)
These are quite essential for really working with unicode characters. It didn't matter much before as Haskell didn't provide good ways to handle unicode chars with IO, but these are very important, otherwise it becomes hard to do many useful things with the parsed unicode characters.
yeah, they would be useful things to have. but no need to tie them specifically to PackedString (though, they would operate on PackedStrings most likely). ginsu and jhc both use unicode extensivly without these routines, so saying it is hard to do useful things is somewhat strong. but they would definitly be very useful to have and necessary for certain applications.
How are we supposed to process user input without normalization e.g. if we need to compare Strings for equivalence?
we implement normalization and provide it as a library :)
But a simple UTF8 layer with more features added later is a good way.
I don't think these features should be in PackedString proper unless they are added to String as well. (as in, in the default instances), however a 'UnicodeString' that is a newtype of PackedString would be easy enough with just different instance declarations. the library routines for performing these transformations can be provided in PackedString of course if that makes sense if they don't conflict with any String operations of the same name. but being able to do 'normalize a == normalize b' would be useful for PackedStrings independent of UnicodeString. John -- John Meacham - ⑆repetae.net⑆john⑈