
Ketil Malde wrote:
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language.
I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8.
Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias. In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future.
Alternatively, we can have different libraries with different representations for different purposes, where you'll get another few percent of juice by switching to the most appropriate.
Currently the latter approach looks to be in favor, so if we can't have one single library, let us at least aim for a set of libraries with consistent interfaces and optimal performance. Data.Text is great for UTF-16, and I'd like to have something similar for UTF-8. Is all I'm trying to say.
I agree. Thanks, Yitz