Re: [Haskell-cafe] Re: String vs ByteString

17 Aug 2010


      Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
UTF-16 "segments" in it list of strict text elements :) Then big chunks of
western text will be encoded efficiently, and same with CJK! Not sure what
to do about strict Data.Text though :)

On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde <ketil@malde.org> wrote:
...
Michael Snoyman <michael@snoyman.com> writes:
...
As far as space usage, you are correct that CJK data will take up more
memory in UTF-8 than UTF-16.
With the danger of sounding ... alphabetist? as well as belaboring a
point I agree is irrelevant (the storage format):
I'd point out that it seems at least as unfair to optimize for CJK at
the cost of Western languages.  UTF-16 uses two bytes for (most) CJK
ideograms, and (all, I think) characters in Western and other phonetic
scripts.  UTF-8 uses one to two bytes for a lot of Western alphabets,
but three for CJK ideograms.
Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram,
while an ASCII letter is about six bits.  Thus, the information density
of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to
15/16 vs 6/16 for UTF-16.  In other words a given document translated
between Chinese and English should occupy roughly the same space in
UTF-8, but be 2.5 times longer in English for UTF-16.
-k
--
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Daniel Peebles