
On Tue, 2006-04-25 at 22:34 +1000, Donald Bruce Stewart wrote:
ross:
On Tue, Apr 25, 2006 at 12:08:45PM +0300, Einar Karttunen wrote:
The name Latin1 is particularly bad since there are many other single byte encodings around.
The name is quite appropriate, since that is the particular encoding of Char that is exposed by the interface. What's bad is that there's no choice. Calling it Latin1 is just being honest about that, and leaving room for modules with other encodings or an interface parameterized by encoding.
Ok. Duncan, Ketil, Ross and Simon make good points here. I'll move Data.ByteString.Char -> Data.ByteString.Latin1
Ok one final point from a discussion between me and Einar Karttunen... (I'm mindful of Simon's comment about sheds... :-) ) There are two different common uses of a 8-bit string library with different assumptions and guarantees. (As it happens they have the same implementation) In one use case, we want to be able to guarantee that we can get Chars out of our string and guarantee that they really are Haskell Chars. That is that they are valid Unicode code points which we could pass to functions like isUpper and get valid answers. As an example consider Char 'Â' (chr 0xC2, Latin capital A with circumflex). This is not ASCII but it is clearly upper case. If we don't know that we're working with an 8-bit subset of Unicode then we can't use Unicode properties like isUpper etc. Then the other common use case is where we have some character string encoding which contains ASCII as a subset. That is we don't know the encoding exactly (it may be Latin1, LatinN, UTF8, etc) but we do know that ASCII chars 0-127 are represent by those same numbers in our byte stream. Examples where this is useful is in parsing network protocols. There are several examples of these which use 8-bit extensions of ASCII but the protocol only gives semantics to chars in the ASCII subset. For this case it would be very inconvenient to have to use an API based just on Word8 but on the other hand we can't give a proper guarantee on being able to turn bytes into Haskell Chars (only for bytes <127). So what do we do about this? Einar was thinking about an API that might look like this: Data.ByteString.{Char8, Latin1, Latin2, ..., UTF8, ...} Char8 should provide: * litle overhead * For ascii characters the right translation * c2w . w2c = id * toUpper and toLower on Ascii * Ord with raw byte values Latin1 should guarantee: * Correct translation for Latin1, C0 and C1 characters * Really just a subset of unicode for character handling * Predicates like toUpper and toLower * toUpper and toLower per Unicode definition (there is no common latin1 definition afaik) * Ord per UCA (unicode collation algorithm) * Or use locale for toUpper/toLower and Ord. So basically the .Char8 module is for the ASCII extension case and the .Latin1 is for the 8-bit Unicode subset case. I think in fact that darcs would want the .Char8 version but I expect that may other users will want a library that can guarantee conversions to ordinary Haskell Chars (which involves an assumption on the character encoding). Duncan