On Fri, Aug 31, 2012 at 7:59 AM, jeff p <mutjida@gmail.com> wrote:
Hello,

I have a sample file (attached) which I cannot read into Text:

    Prelude Control.Applicative> Data.Text.IO.readFile "foo"
    *** Exception: utf8.txt: hGetContents: invalid argument (invalid
byte sequence)

    Prelude Control.Applicative> Data.Text.Encoding.decodeUtf8 <$>
Data.ByteString.Char8.readFile "foo"
    "*** Exception: Cannot decode byte '\x6e':
Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

So it seems that foo doesn't contain valid UTF-8. However,
System.IO.UTF8 has no problem reading the data:

    Prelude Control.Applicative> System.IO.UTF8.readFile "foo"
    "3591,,,dihigma99h,1905,5,25,CUBA,,Matanzas,1971,5,20,CUBA,,Cienfuegos,Martin,Dihigo,,Mart\65533n
Magdaleno Dihigo
    (Llanos),,190,74,R,R,,,,dihigma99,dihigma99,dihim001,dihigma99,dihigma99\r\n"

Shouldn't these all have the same behavior?

\65533 is the unicode replacement character U+FFFD. This means that the source text is not valid UTF-8; the parser in System.IO.UTF8 is silently replacing the bad characters while the others are throwing an exception. If you want the same behaviour with the Text parser, use Data.Text.Encoding.decodeUtf8With which allows you to replicate this. It's likely, however, that your input text is in some other encoding like ISO-8859-1. Use the text-icu package (http://hackage.haskell.org/package/text-icu) to decode these.

G
--
Gregory Collins <greg@gregorycollins.net>