Data.Text UTF-8 question
 
            Hello, I have a sample file (attached) which I cannot read into Text: Prelude Control.Applicative> Data.Text.IO.readFile "foo" *** Exception: utf8.txt: hGetContents: invalid argument (invalid byte sequence) Prelude Control.Applicative> Data.Text.Encoding.decodeUtf8 <$> Data.ByteString.Char8.readFile "foo" "*** Exception: Cannot decode byte '\x6e': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream So it seems that foo doesn't contain valid UTF-8. However, System.IO.UTF8 has no problem reading the data: Prelude Control.Applicative> System.IO.UTF8.readFile "foo" "3591,,,dihigma99h,1905,5,25,CUBA,,Matanzas,1971,5,20,CUBA,,Cienfuegos,Martin,Dihigo,,Mart\65533n Magdaleno Dihigo (Llanos),,190,74,R,R,,,,dihigma99,dihigma99,dihim001,dihigma99,dihigma99\r\n" Shouldn't these all have the same behavior? I am running on Mac OS X 10.8.1, with GHC 7.4.2 and text-0.11.2.3. thanks for any insight, Jeff
 
            On Fri, Aug 31, 2012 at 7:59 AM, jeff p 
Hello,
I have a sample file (attached) which I cannot read into Text:
Prelude Control.Applicative> Data.Text.IO.readFile "foo" *** Exception: utf8.txt: hGetContents: invalid argument (invalid byte sequence)
Prelude Control.Applicative> Data.Text.Encoding.decodeUtf8 <$> Data.ByteString.Char8.readFile "foo" "*** Exception: Cannot decode byte '\x6e': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream
So it seems that foo doesn't contain valid UTF-8. However, System.IO.UTF8 has no problem reading the data:
Prelude Control.Applicative> System.IO.UTF8.readFile "foo"
"3591,,,dihigma99h,1905,5,25,CUBA,,Matanzas,1971,5,20,CUBA,,Cienfuegos,Martin,Dihigo,,Mart\65533n Magdaleno Dihigo
(Llanos),,190,74,R,R,,,,dihigma99,dihigma99,dihim001,dihigma99,dihigma99\r\n"
Shouldn't these all have the same behavior?
\65533 is the unicode replacement character U+FFFD. This means that the
source text is not valid UTF-8; the parser in System.IO.UTF8 is silently
replacing the bad characters while the others are throwing an exception. If
you want the same behaviour with the Text parser, use
Data.Text.Encoding.decodeUtf8With which allows you to replicate this. It's
likely, however, that your input text is in some other encoding like
ISO-8859-1. Use the text-icu package (
http://hackage.haskell.org/package/text-icu) to decode these.
G
-- 
Gregory Collins 
 
            On 12-08-31 01:59 AM, jeff p wrote:
I have a sample file (attached) which I cannot read into Text:
Prelude Control.Applicative> Data.Text.IO.readFile "foo" *** Exception: utf8.txt: hGetContents: invalid argument (invalid byte sequence)
Prelude Control.Applicative> Data.Text.Encoding.decodeUtf8 <$> Data.ByteString.Char8.readFile "foo" "*** Exception: Cannot decode byte '\x6e': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream
At offsets from 0x55 to 0x5A: 0x4D 0x61 0x72 0x74 0xED 0x6E This is clearly not UTF-8. This would be, in ISO-8859-1, "Martín". "Martín" in UTF-8 is 0x4D 0x61 0x72 0x74 0xC3 0xAD 0x6E, and it takes one more byte. And like Gregory Collins says, different UTF-8 decoders may handle errors differently. Some abort. Some others fill in a special character.
participants (3)
- 
                 Albert Y. C. Lai Albert Y. C. Lai
- 
                 Gregory Collins Gregory Collins
- 
                 jeff p jeff p