
On Tue, 2006-04-25 at 13:13 +0100, Duncan Coutts wrote:
On Tue, 2006-04-25 at 13:08 +0100, Simon Marlow wrote:
Donald Bruce Stewart wrote:
The code has been partioned into: Data.ByteString a Word8 only layer. All functions are in terms of Word8 Data.ByteString.Char provides an ascii/byte-Char layer over the Word8 layer.
Ok, but where would we put a UTF8 version of the Char layer? I'm thinking that "Latin1" would be more correct than "Char", and leaves room for adding UTF8 and other encodings later.
As others have pointed out, it's not strictly Latin1. Don and I reckon it's probably safe to say that the current Data.ByteString.Char layer is ok for any 8-bit fixed-width encoding with ASCII as a subset, so that means it's probably ok for many of the Latin* encodings.
How would we distinguish a full fixed0width 4-byte Unicode version? A purist mgiht say that this should be Data.ByteString.Char since a Char really is a 4-byte Unicode value and then change the current Data.ByteString.Char to be Data.ByteString.Char8 or something like that.
Actually after further discussion we've think that strictly Data.ByteString.Char will only fully work with Latin1 because only for Latin1 will the Chars we get back be genuine Unicode code-points (since the first 256 code points of Unicode are the same as Latin1 - or so I am told). For other Latin encodings what you get back will only be a Unicode code point for chars <127. So for other Latin encodings you'd need different implementations of w2c & c2w that map the 256 chars to/from the correct Unicode code points. So that suggests that we might want to call it Data.ByteString.Latin1. At this point we wish we had parameterisable modules so we could have various other encodings just by parameterising on the w2c/c2w mappings. Most of the time you could use Data.ByteString.Latin1 for other Latin encodings and get away with it (so long as you don't want to use things like isUpper for chars >127) which is both a blessing and a curse. Duncan