
On Tue, Aug 17, 2010 at 3:23 PM, Yitzchak Gale
Michael Snoyman wrote:
Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data
True, I haven't seen any - except for Google, which I don't believe is accurate. I would like to see some good unbiased data.
Right now we just have our intuitions based on anecdotal evidence and whatever years of experience we have in IT.
For the anecdotal evidence, I really wish that people from CJK countries were better represented in this discussion. Unfortunately, Haskell is less prevalent in CJK countries, and there is somewhat of a language barrier.
I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.
I agree, I wish we had better numbers.
even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead... As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.
Again, I agree that some real data would be great.
The problem is, I'm not sure if there is anyone in this discussion who is qualified to come up with anything even close to a fair random sampling or a CJK website that is representative. As far as I can tell, most of us participating in this discussion have absolutely zero perspective of what computing is like in CJK countries.
I won't call this a scientific study by any stretch of the imagination, but I did a quick test on the www.qq.com homepage. The original file encoding was GB2312; here are the file sizes:
GB2312: 193014 UTF8: 200044 UTF16: 371938
As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage.
No, there is a third: using an API that results in robust, readable and maintainable code even in the face of changing encoding requirements. Unless you have proof that the difference in performance between that API and an API with a hard-wired encoding is the factor that is causing your particular application to fail to meet its requirements, the hard-wired approach is guilty of aggravated premature optimization.
So for example, UTF-8 is an important option to have in a web toolkit. But if that's the only option, that web toolkit shouldn't be considered a general-purpose one in my opinion.
I'm not talking about API changes here; the topic at hand is the internal representation of the stream of characters used by the text package. That is currently UTF-16; I would argue switching to UTF8.
I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case.
Well, to start with, all MS Word documents are in UTF-16. There are a few of those around I think. Most applications - in some sense of "most" - store text in UTF-16
Again, without any data, my intuition tells me that most of the text data stored in the world's files are in UTF-16. There is currently not much Haskell code that reads those formats directly, but I think that will be changing as usage of Haskell in the real world picks up.
I was referring to text files, not binary files with text embedded within them. While we might use the text package to deal with the data from a Word doc once in memory, we would almost certainly need to use ByteString (or binary perhaps) to actually parse the file. But at the end of the day, you're right: there would be an encoding penalty at a certain point, just not on the entire file.
We can't consider a CJK encoding for text,
Not as a default, certainly not as the only option. But nice to have as a choice.
I think you're missing the point at hand: I don't think *any* is opposed to offering encoders/decoders for all the multitude of encoding types out there. In fact, I believe the text-icu package already supports every encoding type under discussion. The question is the internal representation for text, for which a language-specific encoding is *not* a choice, since it does not support all unicode code points.
Michael