Re: [Haskell-cafe] Re: String vs ByteString

17 Aug 2010


      On Tue, Aug 17, 2010 at 3:23 PM, Yitzchak Gale  wrote:
...
Michael Snoyman wrote:
...
Regarding the data: you haven't actually quoted any
statistics about the prevalence of CJK data
True, I haven't seen any - except for Google, which
I don't believe is accurate. I would like to see some
good unbiased data.
Right now we just have our intuitions based on anecdotal
evidence and whatever years of experience we have in IT.
For the anecdotal evidence, I really wish that people from
CJK countries were better represented in this discussion.
Unfortunately, Haskell is less prevalent in CJK countries,
and there is somewhat of a language barrier.
...
I'd hate to make up statistics on the spot, especially when
I don't have any numbers from you to compare them with.
I agree, I wish we had better numbers.
...
even if the majority of web pages served are
in those three languages, a fairly high percentage
of the content will *still* be ASCII, due simply to the HTML,
CSS and Javascript overhead...
As far as space usage, you are correct that CJK data will take up more
memory in UTF-8 than UTF-16. The question still remains whether the
overall
document size will be larger: I'd be interested in taking a random
sampling
of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
think simply talking about this in the vacuum of data is pointless. If
anyone can recommend a CJK website which would be considered
representative
(or a few), I'll do the test myself.
Again, I agree that some real data would be great.
The problem is, I'm not sure if there is anyone in this discussion
who is qualified to come up with anything even close to a fair
random sampling or a CJK website that is representative.
As far as I can tell, most of us participating in this discussion
have absolutely zero perspective of what computing is like
in CJK countries.
I won't call this a scientific study by any stretch of the imagination, but
I did a quick test on the www.qq.com homepage. The original file encoding
was GB2312; here are the file sizes:
GB2312: 193014
UTF8: 200044
UTF16: 371938
...
...
As far as the conflation, there are two questions
with regard to the encoding choice: encoding/decoding time
and space usage.
No, there is a third: using an API that results in robust, readable
and maintainable code even in the face of changing encoding
requirements. Unless you have proof that the difference in
performance between that API and an API with a hard-wired
encoding is the factor that is causing your particular application
to fail to meet its requirements, the hard-wired approach
is guilty of aggravated premature optimization.
So for example, UTF-8 is an important option
to have in a web toolkit. But if that's the only option, that
web toolkit shouldn't be considered a general-purpose one
in my opinion.
I'm not talking about API changes here; the topic at hand is the internal
representation of the stream of characters used by the text package. That is
currently UTF-16; I would argue switching to UTF8.
...
...
I don't think *anyone* is asserting that
UTF-16 is a common encoding for files anywhere,
so by using UTF-16 we are simply incurring an overhead
in every case.
Well, to start with, all MS Word documents are in UTF-16.
There are a few of those around I think. Most applications -
in some sense of "most" - store text in UTF-16
Again, without any data, my intuition tells me that
most of the text data stored in the world's files are in
UTF-16. There is currently not much Haskell code
that reads those formats directly, but I think that will
be changing as usage of Haskell in the real world
picks up.
I was referring to text files, not binary files with text embedded within
them. While we might use the text package to deal with the data from a Word
doc once in memory, we would almost certainly need to use ByteString (or
binary perhaps) to actually parse the file. But at the end of the day,
you're right: there would be an encoding penalty at a certain point, just
not on the entire file.
...
We can't consider a CJK encoding for text,
Not as a default, certainly not as the only option. But
nice to have as a choice.
I think you're missing the point at hand: I don't think *any* is opposed to
offering encoders/decoders for all the multitude of encoding types out
there. In fact, I believe the text-icu package already supports every
encoding type under discussion. The question is the internal representation
for text, for which a language-specific encoding is *not* a choice, since it
does not support all unicode code points.
Michael