
Hi all, I've got a trivial test program: main :: IO () main = do text <- readFile "unicode.txt" putStr text which I compile with ghc-6.12.1 (from Debian) and when it runs I get: hGetContents: invalid argument (Invalid or incomplete multibyte or wide character) I've done some googling which seems to suggest that I need to set the LANG environment variable, but I already have that set to en_AU.UTF-8. Clues? Cheers, Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

On Sun, Nov 28, 2010 at 8:26 AM, Erik de Castro Lopo
Hi all,
I've got a trivial test program:
main :: IO () main = do text <- readFile "unicode.txt" putStr text
which I compile with ghc-6.12.1 (from Debian) and when it runs I get:
hGetContents: invalid argument (Invalid or incomplete multibyte or wide character)
I've done some googling which seems to suggest that I need to set the LANG environment variable, but I already have that set to en_AU.UTF-8.
Clues?
Cheers, Erik
Perhaps a silly question, but are you certain that the input file is valid UTF-8? You could also try using the readFile from utf8-string[1], which I believe ignores improper UTF8 sequences. A theoretically better approach is to read the contents as a lazy bytestring and then use the decode functions from the text package, but that's a little bit more work. [1] http://hackage.haskell.org/packages/archive/utf8-string/0.3.6/doc/html/Syste...

Michael Snoyman wrote:
Perhaps a silly question, but are you certain that the input file is valid UTF-8?
That is a very good point.
You could also try using the readFile from utf8-string... [or] read the contents as a lazy bytestring and then use the decode functions...
Those approaches are now both deprecated. Either do what you are doing, which gives you conceptually simple strings as lists of Char. Or, for better efficiency, use the text package:
import qualified Data.Text.Lazy as T main :: IO () main = do text <- T.readFile "unicode.txt" T.putStr text
In any case, you still need to have the correct encoding set on the handles as before. (And the input needs to be valid for your selected encoding.) Regards, Yitz

On Sun, Nov 28, 2010 at 8:53 AM, Yitzchak Gale
Michael Snoyman wrote:
Perhaps a silly question, but are you certain that the input file is valid UTF-8?
That is a very good point.
You could also try using the readFile from utf8-string... [or] read the contents as a lazy bytestring and then use the decode functions...
Those approaches are now both deprecated. Either do what you are doing, which gives you conceptually simple strings as lists of Char. Or, for better efficiency, use the text package:
import qualified Data.Text.Lazy as T main :: IO () main = do text <- T.readFile "unicode.txt" T.putStr text
In any case, you still need to have the correct encoding set on the handles as before. (And the input needs to be valid for your selected encoding.)
Which is why I would actually recommend sticking with the bytestring/text combination when you know what the file encoding will be and it is not system-dependent. It's the approach that I use with Hamlet et al for precisely that reason. Michael

On Sun, Nov 28, 2010 at 9:19 AM, Michael Snoyman
On Sun, Nov 28, 2010 at 8:53 AM, Yitzchak Gale
wrote: Michael Snoyman wrote:
Perhaps a silly question, but are you certain that the input file is valid UTF-8?
That is a very good point.
You could also try using the readFile from utf8-string... [or] read the contents as a lazy bytestring and then use the decode functions...
Those approaches are now both deprecated. Either do what you are doing, which gives you conceptually simple strings as lists of Char. Or, for better efficiency, use the text package:
import qualified Data.Text.Lazy as T main :: IO () main = do text <- T.readFile "unicode.txt" T.putStr text
In any case, you still need to have the correct encoding set on the handles as before. (And the input needs to be valid for your selected encoding.)
Which is why I would actually recommend sticking with the bytestring/text combination when you know what the file encoding will be and it is not system-dependent. It's the approach that I use with Hamlet et al for precisely that reason.
Sorry for replying to myself, but I didn't clarify that very well. You're right that setting encoding on the handle can work well enough for this, but it does *not* address invalid byte sequences (AFAIK), which can be dealt with using the bytestring/text decoding combination. Michael

I wrote:
In any case, you still need to have the correct encoding set on the handles as before.
Michael Snoyman wrote:
...it does *not* address invalid byte sequences (AFAIK), which can be dealt with using the bytestring/text decoding combination.
Well, using the standard interface, you have three choices on how to handle invalid byte sequences - drop them, use a replacement character, or throw an exception, with the third choice being the default. You specify that choice when you set the encoding. See the documentation for System.IO for more details. However, those choices are implemented via GNU iconv, so on Windows you only have the default behavior. Also, in certain special situations - like if you need to be able to specify the replacement character yourself, or if you need in-band exceptions (e.g. a stream of Either error character), then the options do seem limited currently. You might still need to fall back on the old bytestring hack in those cases. If you find yourself in that situation, it might be a good idea to push the maintainers of System.IO and Data.Text to continue to improve support for encodings in the standard libraries. Regards, Yitz

On Sun, Nov 28, 2010 at 10:35 AM, Yitzchak Gale
I wrote:
In any case, you still need to have the correct encoding set on the handles as before.
Michael Snoyman wrote:
...it does *not* address invalid byte sequences (AFAIK), which can be dealt with using the bytestring/text decoding combination.
Well, using the standard interface, you have three choices on how to handle invalid byte sequences - drop them, use a replacement character, or throw an exception, with the third choice being the default. You specify that choice when you set the encoding. See the documentation for System.IO for more details.
However, those choices are implemented via GNU iconv, so on Windows you only have the default behavior.
Also, in certain special situations - like if you need to be able to specify the replacement character yourself, or if you need in-band exceptions (e.g. a stream of Either error character), then the options do seem limited currently.
You might still need to fall back on the old bytestring hack in those cases. If you find yourself in that situation, it might be a good idea to push the maintainers of System.IO and Data.Text to continue to improve support for encodings in the standard libraries.
I hadn't realized that the standard libraries offered so much sophistication in their approach to file encodings, I'll have to look at it more thoroughly. Michael

Erik de Castro Lopo wrote:
hGetContents: invalid argument (Invalid or incomplete multibyte or wide character) I've done some googling which seems to suggest that I need to set the LANG environment variable, but I already have that set to en_AU.UTF-8.
You can check to see what encoding GHC has picked up from your environment by examining localeEncoding. You can force the encoding to UTF-8 by hSetEncoding stdin utf8 hSetEncoding stdout utf8 All of the above in the context of import System.IO, of course. Regards, Yitz

Yitzchak Gale wrote:
You can check to see what encoding GHC has picked up from your environment by examining localeEncoding.
How do I do that? TextEncoding doesn't seem to be Showable.
You can force the encoding to UTF-8 by
hSetEncoding stdin utf8 hSetEncoding stdout utf8
All of the above in the context of import System.IO, of course.
Thank you. My program: main :: IO () main = do h <- openFile "unicode.txt" ReadMode hSetEncoding h utf8 hSetEncoding stdout utf8 text <- hGetContents h putStr text now works as it should. Cheers, Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/
participants (3)
-
Erik de Castro Lopo
-
Michael Snoyman
-
Yitzchak Gale