Re: [Haskell-cafe] Re: String vs ByteString

18 Aug 2010

      John Millikin  writes:
...
The reason many Japanese and Chinese users reject UTF-8 isn't due to 
space constraints (UTF-8 and UTF-16 are roughly equal), it's because
they reject Unicode itself.
Probably because they don't think it's complicated enough¹?
...
Shift-JIS and the various Chinese encodings both contain Han
characters which are missing from Unicode, either due to the Han
unification or simply were not considered important enough to include
Surely there's enough space left?  I seem to remember some Han
characters outside of the BMP, so I would have guessed this is an
argument from back in the UCS-2 days.

(BTW, on a long train ride, I brought the linear-B alphabet, and
practiced writing notes to my kids.  So linear-B isn't entirely useless
:-)
...
From casual browsing of Wikipedia, the current status in CJK-land seems
to be something like this:
China: GB2312 and its successor GB18030
Taiwan, Macao, and Hong Kong: Big5
Japan: Shift-JIS
Korea: EUC-KR

It is interesting that some of these provide a lot fewer characters than
Unicode.  Another feature of several of them is that ASCII and e.g. kana
scripts take up one byte, and ideograms take up two, which correlates
with the expected width of the glyphs.

Several of the pages indicate that Unicode, and mainly UTF-8, is
gradually taking over.

-k

¹ Those who remember Emacs in the MULE days will know what I mean.
-- 
If I haven't seen further, it is by standing in the footprints of giants

Re: [Haskell-cafe] Re: String vs ByteString

Ketil Malde