
On Tue, Jan 4, 2011 at 7:08 PM, Tony Morris
I am reading files with System.IO.readFile. Some of these files start with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that process this String, this causes choking so I drop the BOM as shown below. This feels particularly hacky, but I am not in control of many of these functions (that perhaps could use ByteString with a better solution).
I'm wondering if there is a better way of achieving this goal. Thanks for any tips.
dropBOM :: String -> String dropBOM [] = [] dropBOM s@(x:xs) = let unicodeMarker = '\65279' -- UTF-8 BOM in if x == unicodeMarker then xs else s
readBOMFile :: FilePath -> IO String readBOMFile p = dropBOM `fmap` readFile p
Are you thinking that the BOM should be automatically stripped from UTF8 text at some low level, if present? I was thinking about it, and I was deeply conflicted about the idea. Then I read the unicode.org BOM faq[1], and I'm still conflicted. I'm thinking that it would be correct behavior to drop the BOM from the start of a UTF8 stream, even at a pretty low level. The FAQ seems to allow it as a means of identifying the stream as UTF8 (although it isn't a reliable means of identifying a stream as UTF8). But I'm no unicode expert. Antoine [1] http://unicode.org/faq/utf_bom.html
-- Tony Morris http://tmorris.net/
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe