
On Sat, Aug 14, 2010 at 22:39, Edward Z. Yang
Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:
Also, despite the name, ByteString and Text are for separate purposes. ByteString is an efficient [Word8], Text is an efficient [Char] -- use ByteString for binary data, and Text for...text. Most mature languages have both types, though the choice of UTF-16 for Text is unusual.
Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode text representations, I cannot really agree with "unusual". :-)
Python doesn't use UTF-16; on UNIX systems it uses UCS-4, and on WIndows it uses UCS-2. The difference is important because: Python: len("\U0001dd1e") == 2 Haskell: length (pack "\x0001dd1e") Java, .NET, Windows, JavaScript, and some other languages use UTF-16 because when Unicode support was added to these systems, the astral characters had not been invented yet, and 16 bits was enough for the entire Unicode character set. They originally used UCS-2, but then moved to UTF-16 to minimize incompatibilities. Anything based on UNIX generally uses UTF-8, because Unicode support was added later after the problems of UCS-2/UTF-16 had been discovered. C libraries written by UNIX users use UTF-8 almost exclusively -- this includes most language bindings available on Hackage. I don't mean that UTF-16 is itself unusual, but it's a legacy encoding -- there's no reason to use it in new projects. If "text" had been started 15 years ago, I could understand, but since it's still in active development the use of UTF-16 simply adds baggage.