
Alright, here's the results for the first three in the list (please forgive
me for being lazy- I am a Haskell programmer after all):
ifeng.com:
UTF8: 299949
UTF16: 566610
dzh.mop.com:
GBK: 1866
UTF8: 1891
UTF16: 3684
www.csdn.net:
UTF8: 122870
UTF16: 217420
Seems like UTF8 is a consistent winner versus UTF16, and not much of a loser
to the native formats.
Michael
On Wed, Aug 18, 2010 at 11:01 AM, anderson leo
More typical Chinese web sites: www.ifeng.com (web site likes nytimes) dzh.mop.com (community for fun) www.csdn.net (web site for IT) www.sohu.com (web site like yahoo) www.sina.com (web site like yahoo)
-- Andrew
On Wed, Aug 18, 2010 at 11:40 AM, Michael Snoyman
wrote: Well, I'm not certain if it counts as a typical Chinese website, but here are the stats;
UTF8: 64,198 UTF16: 113,160
And just for fun, after gziping:
UTF8: 17,708 UTF16: 19,367
On Wed, Aug 18, 2010 at 2:59 AM, anderson leo
wrote: Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese.
-Andrew
On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman
wrote: On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale
wrote: Ketil Malde wrote:
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language.
I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8.
Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias.
In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future.
I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.
As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8.
As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.
Michael
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe