
5 Apr
2011
5 Apr
'11
3:22 a.m.
On Mon, 2011-04-04 at 11:50 +0200, Roel van Dijk wrote:
I am not aware of any algorithm that can reliably infer the character encoding used by just looking at the raw data. Why would people bother with stuff like <?xml version="1.0" encoding="UTF-8"?> if automatically figuring out the encoding was easy?
It is possible, if the syntax/grammar of the encoded content restricts the set of allowed code-points in the first few characters. For instance, valid JSON (see RFC 4673 section 3) requires the first two characters to be plain "ASCII" code-points, thus which of the 5 BOM-less UTF-encodings is used is uniquely determined by inspecting the first 4 bytes of the UTF encoded stream.