
On Mon, 2009-03-09 at 18:29 -0700, Alexander Dunlap wrote:
Thanks for all of the responses!
So let me see if my summary is accurate here:
- ByteString is for just that: strings of bytes, generally read off of a disk. The Char8 version just interprets the Word8s as Chars but doesn't do anything special with that.
Right. So it's only suitable for binary or ASCII (or mixed) formats.
- Data.Text/text library is a higher-level library that deals with "text," abstracting over Unicode details and treating each element as a potentially-multibye "character."
If you're writing about this on the wiki for people, it's best not to confuse the issue by talking about multibyte anything. We do not describe [Char] as a multibyte encoding of Unicode. We say it is a Unicode string. The abstraction is at the level of Unicode code points. The String type *is* a sequence of Unicode code points. This is exactly the same for Data.Text. It is a sequence of Unicode code points. It is not an encoding. It is not UTF-anything. It does not abstract over Unicode. The Text type is just like the String type but with different strictness and performance characteristics. Both are just sequences of Unicode code points. There is a reasonably close correspondence between Unicode code points and what people normally think of as characters.
- utf8-string is a wrapper over ByteString that interprets the bytes in the bytestring as potentially-multibye unicode "characters."
This on the other hand is an encoding. ByteString is a sequence of bytes and when we interpret that as UTF-8 then we are looking at an encoding of a sequence of Unicode code points. Clear as mud? :-) Duncan