
Michael Snoyman wrote:
Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data
True, I haven't seen any - except for Google, which I don't believe is accurate. I would like to see some good unbiased data. Right now we just have our intuitions based on anecdotal evidence and whatever years of experience we have in IT. For the anecdotal evidence, I really wish that people from CJK countries were better represented in this discussion. Unfortunately, Haskell is less prevalent in CJK countries, and there is somewhat of a language barrier.
I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.
I agree, I wish we had better numbers.
even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead... As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.
Again, I agree that some real data would be great. The problem is, I'm not sure if there is anyone in this discussion who is qualified to come up with anything even close to a fair random sampling or a CJK website that is representative. As far as I can tell, most of us participating in this discussion have absolutely zero perspective of what computing is like in CJK countries.
As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage.
No, there is a third: using an API that results in robust, readable and maintainable code even in the face of changing encoding requirements. Unless you have proof that the difference in performance between that API and an API with a hard-wired encoding is the factor that is causing your particular application to fail to meet its requirements, the hard-wired approach is guilty of aggravated premature optimization. So for example, UTF-8 is an important option to have in a web toolkit. But if that's the only option, that web toolkit shouldn't be considered a general-purpose one in my opinion.
I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case.
Well, to start with, all MS Word documents are in UTF-16. There are a few of those around I think. Most applications - in some sense of "most" - store text in UTF-16 Again, without any data, my intuition tells me that most of the text data stored in the world's files are in UTF-16. There is currently not much Haskell code that reads those formats directly, but I think that will be changing as usage of Haskell in the real world picks up.
We can't consider a CJK encoding for text,
Not as a default, certainly not as the only option. But nice to have as a choice.
What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8,
In Western countries. Regards, Yitz