On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <gale@sefer.org> wrote:Ketil Malde wrote:> RAM, UTF-16 will be slower than UTF-8...
> I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
> 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
I don't think the genome is typical text. And
I doubt that is true if that text is in a CJK language.
As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.