
On Tue, Aug 17, 2010 at 03:21:32PM +0200, Daniel Peebles wrote:
Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and UTF-16 "segments" in it list of strict text elements :) Then big chunks of western text will be encoded efficiently, and same with CJK! Not sure what to do about strict Data.Text though :)
If space is really a concern, there should be a varient that uses LZO or some other fast compression algorithm that allows concatination as the back end. <ranty thing to follow> That said, there is never a reason to use UTF-16, it is a vestigial remanent from the brief period when it was thought 16 bits would be enough for the unicode standard, any defense of it nowadays is after the fact justification for having accidentally standardized on it back in the day. When people chose to use the 16 bit representation, it was because they wanted a one-to-one mapping between codepoints and units of computation, which has many advantages. However, this is no longer true, if the one-to-one mapping is important then nowadays you use ucs-4, otherwise, you use utf8. If space is very important then you work with compressed text. In practice a mix of the two is fairly ideal. John -- John Meacham - ⑆repetae.net⑆john⑈ - http://notanumber.net/