Re: [Haskell-cafe] UTF-8 BOM

5 Jan 2011

      On Tue, Jan 4, 2011 at 7:08 PM, Tony Morris  wrote:
...
I am reading files with System.IO.readFile. Some of these files start
with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that
process this String, this causes choking so I drop the BOM as shown
below. This feels particularly hacky, but I am not in control of many of
these functions (that perhaps could use ByteString with a better solution).
I'm wondering if there is a better way of achieving this goal. Thanks
for any tips.
dropBOM ::
 String
 -> String
dropBOM [] =
 []
dropBOM s@(x:xs) =
 let unicodeMarker = '\65279' -- UTF-8 BOM
 in if x == unicodeMarker then xs else s
readBOMFile ::
 FilePath
 -> IO String
readBOMFile p =
 dropBOM `fmap` readFile p
Are you thinking that the BOM should be automatically stripped from
UTF8 text at some low level, if present?

I was thinking about it, and I was deeply conflicted about the idea.
Then I read the unicode.org BOM faq[1], and I'm still conflicted.

I'm thinking that it would be correct behavior to drop the BOM from
the start of a UTF8 stream, even at a pretty low level. The FAQ seems
to allow it as a means of identifying the stream as UTF8 (although it
isn't a reliable means of identifying a stream as UTF8).

But I'm no unicode expert.

Antoine

[1] http://unicode.org/faq/utf_bom.html
...
--
Tony Morris
http://tmorris.net/
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] UTF-8 BOM

Antoine Latter