
Graham Klyne
How can it make sense to have a BOM in UTF-8? UTF-8 is a sequence of octets (bytes); what ordering is there here that can sensibly be varied?
The *name* "BOM" doesn't make sense when applied to UTF-8, but some software uses UTF-8 encoded U+FEFF it as a marker that the file is encoded in UTF-8 rather than some other encoding. And Unicode seems to support this usage, even if it doesn't recommend it. I know only of Microsoft Notepad, and suspect other Microsoft tools (Notepad assumes UTF-8 with the marker and the current Windows codepage without). The HTML at http://www.microsoft.com/ begins with a BOM, but other pages linked from there do not. I think XML used to be silent about this, but later got amended to explicitly say that optional U+FEFF at the beginning is allowed and not treated as a part of document contents. OTOH various other sofrware, in particular generic Unix tools, don't treat UTF-8 BOM specially, and de facto implement the "non-standard" UTF-8 without a BOM. Technically in UTF-16/32 the BOM is handled in the translation between encoding form (sequence of 16- or 32-bit code units) and encoding scheme (these words serialized into bytes). I think it's supposed to be the same in UTF-8, i.e. the analogous translation is *almost* trivial - it translates bytes to the same bytes - except that initial BOM must be stripped on decoding, and it must be added on encoding when the first character of the contents is U+FEFF (and optionally in other cases). I mean that it is supposed to happen on decoding UTF-8 on the level of bytes, not after decoding on the level of code points. Anyway, on Unix it just doesn't happen at all, except in software which explicitly handles it. iconv() doesn't handle UTF-8 BOM. If I could decide about it, I would ban UTF-8 BOM at all. But perhaps Unicode Consortium can be at least persuaded to recognize that some software doesn't accept BOM in UTF-8, and could be conforming to the variant of UTF-8 without the BOM rather than non-conforming at all. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/