
On 2005-01-30, Marcin 'Qrczak' Kowalczyk
Glynn Clements
writes: And it isn't a theoretical issue. E.g. in an environment where EUC-JP is used, filenames may begin with <ESC>$)B (designate JISX0208 to G1), or they may not (because G1 is assumed to contain JISX0208 initally).
I think such encodings are never used as default encodings of a Unix locale.
The various UTF encodings do not have this particular problem; if a UTF string is valid, then it is a unique representation of a unicode string.
BOM is a problem. Unfortunately Unicode mandates that FEFF at the start of a UTF-8 text stream is a mark which doesn't belong to the text.
Right
It provides variants of UTF-16/32 with and without a BOM, but UTF-8 only has the variant with a BOM. This makes UTF-8 a stateful encoding.
I think you mean "UTF-8 only has the variant without a BOM". Otherwise I'd like to see a citation in the standard for this. Because that's not the reading I get from http://www.unicode.org/faq/utf_bom.html. Instead, it seems that whether the BOM is included or not is a function of the protocol, and that the UTF-8 streams themselves do not include the BOM. -- Aaron Denney -><-