Re: [Haskell-cafe] Re: String vs ByteString

17 Aug 2010

      Ketil Malde wrote:
...
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And
I doubt that is true if that text is in a CJK language.
...
I think that *IF* we are aiming for a single, grand, unified text
library to Rule Them All, it needs to use UTF-8.
Given the growth rate of China's economy, if CJK isn't
already the majority of text being processed in the world,
it will be soon. I have seen media reports claiming CJK is
now a majority of text data going over the wire on the web,
though I haven't seen anything scientific backing up those claims.
It certainly seems reasonable. I believe Google's measurements
based on their own web index showing wide adoption of UTF-8
are very badly skewed due to a strong Western bias.

In that case, if we have to pick one encoding for Data.Text,
UTF-16 is likely to be a better choice than UTF-8, especially
if the cost is fairly low even for the special case of Western
languages. Also, UTF-16 has become by far the dominant internal
text format for most software and for most user platforms.
Except on desktop Linux - and whether we like it or not, Linux
desktops will remain a tiny minority for the foreseeable future.
...
Alternatively, we
can have different libraries with different representations for
different purposes, where you'll get another few percent of juice by
switching to the most appropriate.
Currently the latter approach looks to be in favor, so if we can't have
one single library, let us at least aim for a set of libraries with
consistent interfaces and optimal performance.  Data.Text is great for
UTF-16, and I'd like to have something similar for UTF-8.  Is all I'm
trying to say.
I agree.

Thanks,
Yitz

Re: [Haskell-cafe] Re: String vs ByteString

Yitzchak Gale