UTF-8 BOM - Haskell-Cafe - Haskell.org

newer
Exportable and importable instances

UTF-8 BOM

older
Problem on overlapping instances

Tony Morris

5 Jan 2011 5 Jan '11

1:08 a.m.

I am reading files with System.IO.readFile. Some of these files start with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that process this String, this causes choking so I drop the BOM as shown below. This feels particularly hacky, but I am not in control of many of these functions (that perhaps could use ByteString with a better solution). I'm wondering if there is a better way of achieving this goal. Thanks for any tips. dropBOM :: String -> String dropBOM [] = [] dropBOM s@(x:xs) = let unicodeMarker = '\65279' -- UTF-8 BOM in if x == unicodeMarker then xs else s readBOMFile :: FilePath -> IO String readBOMFile p = dropBOM `fmap` readFile p -- Tony Morris http://tmorris.net/

Reply

Sign in to reply online Use email software

Show replies by date

Antoine Latter

5 Jan 5 Jan

1:41 a.m.

On Tue, Jan 4, 2011 at 7:08 PM, Tony Morris wrote:

I am reading files with System.IO.readFile. Some of these files start with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that process this String, this causes choking so I drop the BOM as shown below. This feels particularly hacky, but I am not in control of many of these functions (that perhaps could use ByteString with a better solution).

I'm wondering if there is a better way of achieving this goal. Thanks for any tips.

dropBOM :: String -> String dropBOM [] = [] dropBOM s@(x:xs) = let unicodeMarker = '\65279' -- UTF-8 BOM in if x == unicodeMarker then xs else s

readBOMFile :: FilePath -> IO String readBOMFile p = dropBOM `fmap` readFile p

Are you thinking that the BOM should be automatically stripped from UTF8 text at some low level, if present? I was thinking about it, and I was deeply conflicted about the idea. Then I read the unicode.org BOM faq[1], and I'm still conflicted. I'm thinking that it would be correct behavior to drop the BOM from the start of a UTF8 stream, even at a pretty low level. The FAQ seems to allow it as a means of identifying the stream as UTF8 (although it isn't a reliable means of identifying a stream as UTF8). But I'm no unicode expert. Antoine [1] http://unicode.org/faq/utf_bom.html

-- Tony Morris http://tmorris.net/

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply

Sign in to reply online Use email software

Mark Lentczner

6 Jan 6 Jan

5:44 a.m.

On Jan 4, 2011, at 5:41 PM, Antoine Latter wrote:

Are you thinking that the BOM should be automatically stripped from UTF8 text at some low level, if present?

It should not. Wether or not a U+FFEF can be stripped depends on context in which it is found. There is no way that lower level code, even file primitives, can know this context.

I'm thinking that it would be correct behavior to drop the BOM from the start of a UTF8 stream, even at a pretty low level. The FAQ seems to allow it as a means of identifying the stream as UTF8 (although it isn't a reliable means of identifying a stream as UTF8).

§3.9 and §3.10 of the Unicode standard go into more depth on the issue and make things more clear. A leading U+FFEF is considered "not part of the text", and dropped, only in the case that the encoding is UTF-16 or UTF-32. In all other cases (including the -BE and -LE variants of UTF-16 and UTF-32) the U+FFEF character is retained. The FAQ states that a leading byte sequence of EF BB BF in a stream indicates that the stream is UTF-8, though it doesn't go so far as to say that it can be stripped. Since Unicode doesn't want to encourage the use of BOM in UTF-8 (see end of §3.10), I imagine they don't want to promulgate it as a useful encoding indicator. So, it might be reasonable that when opening a file in UTF-16 mode (not UTF-16BE or UTF-16LE), that the system should read the initial bytes, determine the byte order, and remove the BOM if present[1]. But it isn't safe or correct to do this for UTF-8. On Jan 4, 2011, at 5:08 PM, Tony Morris wrote:

I am reading files with System.IO.readFile. Some of these files start with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that process this String, this causes choking so I drop the BOM as shown below.

If you mean functions in the standard libs shouldn't have any problems with the BOM character. If they do, these are bugs. On the other hand, if you know the context of the files, and know for certain that the leading BOM is intended only as an encoding indicator, then by all means strip it off. But only you can know if this is true for your application, the system cannot. If so, your code doesn't look hackish to me at all. I'd only perhaps tidy up dropBOM a bit (but this is pure stylistic choice): readBomFile :: FilePath -> IO String readBomFile p = dropBom `fmap` readFile p where dropBom (\xffef:s) = s -- U+FFEF at the start is a BOM dropBom s = s I'd keep dropBom private to readBomFile to ensure that it isn't used on arbitrary strings, since it is really only valid at the start of an encoded stream. - Mark [1] Software reading a single text stream that has been split across files would have a problem here. But this is perhaps an obscure and unlikely case.

Reply

Sign in to reply online Use email software

Simon Marlow

7 Jan 7 Jan

12:57 p.m.

On 06/01/2011 05:44, Mark Lentczner wrote:

On Jan 4, 2011, at 5:41 PM, Antoine Latter wrote:

...
Are you thinking that the BOM should be automatically stripped from UTF8 text at some low level, if present?

It should not. Wether or not a U+FFEF can be stripped depends on context in which it is found. There is no way that lower level code, even file primitives, can know this context.

...
I'm thinking that it would be correct behavior to drop the BOM from the start of a UTF8 stream, even at a pretty low level. The FAQ seems to allow it as a means of identifying the stream as UTF8 (although it isn't a reliable means of identifying a stream as UTF8).

§3.9 and §3.10 of the Unicode standard go into more depth on the issue and make things more clear. A leading U+FFEF is considered "not part of the text", and dropped, only in the case that the encoding is UTF-16 or UTF-32. In all other cases (including the -BE and -LE variants of UTF-16 and UTF-32) the U+FFEF character is retained.

The FAQ states that a leading byte sequence of EF BB BF in a stream indicates that the stream is UTF-8, though it doesn't go so far as to say that it can be stripped. Since Unicode doesn't want to encourage the use of BOM in UTF-8 (see end of §3.10), I imagine they don't want to promulgate it as a useful encoding indicator.

So, it might be reasonable that when opening a file in UTF-16 mode (not UTF-16BE or UTF-16LE), that the system should read the initial bytes, determine the byte order, and remove the BOM if present[1]. But it isn't safe or correct to do this for UTF-8.

This is exactly what the built-in System.IO.utf16 codec does. There's also a utf8_bom which behaves like UTF8 except that it strips an optional leading BOM when reading and emits a BOM when writing. Cheers, Simon

Reply

Sign in to reply online Use email software

Albert Y. C. Lai

5 Jan 5 Jan

2:55 a.m.

On 11-01-04 08:08 PM, Tony Morris wrote:

I am reading files with System.IO.readFile. Some of these files start with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf).

There is System.IO.utf8_bom for that. Of course, then you can't use readFile; but you can use openFile and hGetContents.

Reply

Sign in to reply online Use email software

Gregory Collins

6:22 p.m.

Use the text library instead? On Jan 5, 2011 2:09 AM, "Tony Morris" wrote:

I am reading files with System.IO.readFile. Some of these files start with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that process this String, this causes choking so I drop the BOM as shown below. This feels particularly hacky, but I am not in control of many of these functions (that perhaps could use ByteString with a better

solution).

I'm wondering if there is a better way of achieving this goal. Thanks for any tips.

dropBOM :: String -> String dropBOM [] = [] dropBOM s@(x:xs) = let unicodeMarker = '\65279' -- UTF-8 BOM in if x == unicodeMarker then xs else s

readBOMFile :: FilePath -> IO String readBOMFile p = dropBOM `fmap` readFile p

-- Tony Morris http://tmorris.net/

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply

Sign in to reply online Use email software

5297

Age (days ago)

5299

Last active (days ago)

Download

5 comments

6 participants

tags

participants (6)

Albert Y. C. Lai
Antoine Latter
Gregory Collins
Mark Lentczner
Simon Marlow
Tony Morris