
On Fri, Aug 31, 2012 at 7:59 AM, jeff p
Hello,
I have a sample file (attached) which I cannot read into Text:
Prelude Control.Applicative> Data.Text.IO.readFile "foo" *** Exception: utf8.txt: hGetContents: invalid argument (invalid byte sequence)
Prelude Control.Applicative> Data.Text.Encoding.decodeUtf8 <$> Data.ByteString.Char8.readFile "foo" "*** Exception: Cannot decode byte '\x6e': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream
So it seems that foo doesn't contain valid UTF-8. However, System.IO.UTF8 has no problem reading the data:
Prelude Control.Applicative> System.IO.UTF8.readFile "foo"
"3591,,,dihigma99h,1905,5,25,CUBA,,Matanzas,1971,5,20,CUBA,,Cienfuegos,Martin,Dihigo,,Mart\65533n Magdaleno Dihigo
(Llanos),,190,74,R,R,,,,dihigma99,dihigma99,dihim001,dihigma99,dihigma99\r\n"
Shouldn't these all have the same behavior?
\65533 is the unicode replacement character U+FFFD. This means that the
source text is not valid UTF-8; the parser in System.IO.UTF8 is silently
replacing the bad characters while the others are throwing an exception. If
you want the same behaviour with the Text parser, use
Data.Text.Encoding.decodeUtf8With which allows you to replicate this. It's
likely, however, that your input text is in some other encoding like
ISO-8859-1. Use the text-icu package (
http://hackage.haskell.org/package/text-icu) to decode these.
G
--
Gregory Collins