
Marcin 'Qrczak' Kowalczyk wrote:
The various UTF encodings do not have this particular problem; if a UTF string is valid, then it is a unique representation of a unicode string. However, decoding is still a partial function and can fail.
And while it is partly true, it is qualified by the problems relative to canonicalization (an "-B�" in Unicode can both be represented as "�" or as two-A chars (an e and an accent) and they should (ideally) compare equal).
In what sense "equal"? They are supposed to be equivalent as far as the semantics of the text is concerned, but representations are clearly different and most programs distinguish them. In particular they are different filenames on both Unix and Windows. AFAIK MacOS normalizes filenames, but using a slightly different algorithm than Unicode (perhaps just an older version).
IMHO it makes no sense to pretend that they are exactly the same when strings consist of code points or lower level units (and I don't believe another choice for the default string type would be practical).
Well, at least you and I agree on that.
Once you start down the "semantic equivalence" route, you will quickly
run into issues like "�" == "ss", and it only gets worse from there
on.
--
Glynn Clements