Re: [Haskell-cafe] File path programme

30 Jan 2005

      robert dockins wrote:
...
...
I don't pretend to fully understand various unicode standard but it
seems to me that these problems are deeper than file path library. The
equation (decode . encode)
/= id seems confusing for me. Can you give me an example when this
happen?
I am pretty sure that ISO 2022 encoded strings can have multiple ways to 
express the same unicode glyphs.  This means that any sensible relation 
between IS0 2022 strings and unicode strings maps more than one ISO 2022 
string onto the same unicode string.  The inverse is therefore not a 
function.  To make it a function one of the possibly several encodings 
of the unicode string will have to be chosen.  So you have a ISO 2022 
string A which is decoded to a unicode string U.  We reencode U to an 
ISO 2022 string B.  It may be that A /= B.  That is the problem.
Exactly.

And it isn't a theoretical issue. E.g. in an environment where EUC-JP
is used, filenames may begin with <ESC>$)B (designate JISX0208 to G1),
or they may not (because G1 is assumed to contain JISX0208 initally).

More generally, ISO-2022 strings frequently contain redundant
character-set switching sequences, so conversion to unicode and back
again typically won't yield the original sequence of bytes.
...
The various UTF encodings do not have this particular problem; if a UTF 
string is valid, then it is a unique representation of a unicode string.
Except that there are some ad-hoc extensions, e.g. the UTF-8 variant
used by both Java and Tcl permits NUL characters to be embedded in
NUL-terminated UTF-8 strings by encoding them as a two-byte sequence
(which is invalid in UTF-8 proper).

-- 
Glynn Clements