Re: [Haskell-cafe] UTF-8 BOM

7 Jan 2011


      On 06/01/2011 05:44, Mark Lentczner wrote:
...
On Jan 4, 2011, at 5:41 PM, Antoine Latter wrote:
...
Are you thinking that the BOM should be automatically stripped
from UTF8 text at some low level, if present?
It should not. Wether or not a U+FFEF can be stripped depends on
context in which it is found. There is no way that lower level code,
even file primitives, can know this context.
...
I'm thinking that it would be correct behavior to drop the BOM
from the start of a UTF8 stream, even at a pretty low level. The
FAQ seems to allow it as a means of identifying the stream as UTF8
(although it isn't a reliable means of identifying a stream as
UTF8).
§3.9 and §3.10 of the Unicode standard go into more depth on the
issue and make things more clear. A leading U+FFEF is considered "not
part of the text", and dropped, only in the case that the encoding is
UTF-16 or UTF-32. In all other cases (including the -BE and -LE
variants of UTF-16 and UTF-32) the U+FFEF character is retained.
The FAQ states that a leading byte sequence of EF BB BF in a stream
indicates that the stream is UTF-8, though it doesn't go so far as to
say that it can be stripped. Since Unicode doesn't want to encourage
the use of BOM in UTF-8 (see end of §3.10), I imagine they don't want
to promulgate it as a useful encoding indicator.
So, it might be reasonable that when opening a file in UTF-16 mode
(not UTF-16BE or UTF-16LE), that the system should read the initial
bytes, determine the byte order, and remove the BOM if present[1].
But it isn't safe or correct to do this for UTF-8.
This is exactly what the built-in System.IO.utf16 codec does.  There's 
also a utf8_bom which behaves like UTF8 except that it strips an 
optional leading BOM when reading and emits a BOM when writing.

Cheers,
	Simon

Re: [Haskell-cafe] UTF-8 BOM

Simon Marlow