
On 06/01/2011 05:44, Mark Lentczner wrote:
On Jan 4, 2011, at 5:41 PM, Antoine Latter wrote:
Are you thinking that the BOM should be automatically stripped from UTF8 text at some low level, if present?
It should not. Wether or not a U+FFEF can be stripped depends on context in which it is found. There is no way that lower level code, even file primitives, can know this context.
I'm thinking that it would be correct behavior to drop the BOM from the start of a UTF8 stream, even at a pretty low level. The FAQ seems to allow it as a means of identifying the stream as UTF8 (although it isn't a reliable means of identifying a stream as UTF8).
§3.9 and §3.10 of the Unicode standard go into more depth on the issue and make things more clear. A leading U+FFEF is considered "not part of the text", and dropped, only in the case that the encoding is UTF-16 or UTF-32. In all other cases (including the -BE and -LE variants of UTF-16 and UTF-32) the U+FFEF character is retained.
The FAQ states that a leading byte sequence of EF BB BF in a stream indicates that the stream is UTF-8, though it doesn't go so far as to say that it can be stripped. Since Unicode doesn't want to encourage the use of BOM in UTF-8 (see end of §3.10), I imagine they don't want to promulgate it as a useful encoding indicator.
So, it might be reasonable that when opening a file in UTF-16 mode (not UTF-16BE or UTF-16LE), that the system should read the initial bytes, determine the byte order, and remove the BOM if present[1]. But it isn't safe or correct to do this for UTF-8.
This is exactly what the built-in System.IO.utf16 codec does. There's also a utf8_bom which behaves like UTF8 except that it strips an optional leading BOM when reading and emits a BOM when writing. Cheers, Simon