Re: [Haskell-cafe] Unicode pretty-printing

Thanks for getting back to me. I was imprecise, by UTF8 characters I
mean Unicode. My source files are UTF8-encoded, and Haskell reads them
fine, it only has problems outputting them in a readable way. At this
point I'm not talking of any I/O besides plain console output.
Not using Show is not that of a choice, since I'm using HUnit which
uses Show and prints the test results via the standard output
functions. I've tried to wrap my strings and redefine Show so that it
doesn't escape anything, but the standard output functions don't
accept that, and HUnit doesn't know anything about System.IO.UTF8:
----
import System.IO.UTF8
import qualified System.IO
import Test.HUnit
newtype UString = UString String
instance Show UString where
show (UString s) = s
instance Eq UString where
(==) (UString s1) (UString s2) = s1 == s2
test1 = TestCase (assertEqual "fail" (UString "абв") (UString "где"))
main =
System.IO.hSetBinaryMode System.IO.stdout True >>
System.IO.UTF8.putStrLn "это тест"
---------
Prelude> :load utest.hs
[1 of 1] Compiling Main ( utest.hs, interpreted )
Ok, modules loaded: Main.
*Main> main
это тест
*Main> runTestTT test1
### Failure:
fail
expected: *** Exception: <stderr>: hPutChar: invalid argument (Illegal
byte sequence)
---------
I've tried replacing UString X in the test with Data.Text.pack X and
even desperately with Data.Text.Encoding.encodeUtf8 (Data.Text.pack
X), but no dice. Though this time instead of crashes I get the good
old escapes.
On 29 August 2010 00:09, Yitzchak Gale
Peter Gromov wrote:
Unfortunately, Haskell escapes UTF8 characters.
What do you mean by "UTF8 characters"?
Each element of the Char type represents a single Unicode character, not encoded in UTF-8 or any other encoding.
When you read a text file using the traditional IO functions, recent versions of GHC will use the encoding of the "current locale" (whatever that means on your system) to decode the input into Unicode, unless you specify otherwise. The same is true for writing to the console or to a file.
As Don pointed out, you may be interested in using the newer Data.Text instead, especially when encodings matter to you. It will usually be faster than traditional IO, and it is designed to be the new standard for representing text in Haskell.
A third option would be to read the data as raw binary bytes, without any decoding, using Data.ByteString. Then it is totally up to you to do any decoding or encoding.
In any case, the standard Show instances will not be able to do a very good job of displaying non-ASCII characters; Show cannot make very many assumptions about your data or your environment. As Don suggested, you may want to define your own type class similar to Show that does what you want.
Regards, Yitz

2010/8/29 Peter Gromov
Thanks for getting back to me. I was imprecise, by UTF8 characters I mean Unicode. My source files are UTF8-encoded, and Haskell reads them fine, it only has problems outputting them in a readable way. At this point I'm not talking of any I/O besides plain console output.
How are you outputting them? Unless you use a textual String I/O function with GHC 6.12, then by default showing a String will print it with escape characters for non-latin characters.
Prelude> :load utest.hs [1 of 1] Compiling Main ( utest.hs, interpreted ) Ok, modules loaded: Main. *Main> main это тест *Main> runTestTT test1 ### Failure: fail expected: *** Exception: <stderr>: hPutChar: invalid argument (Illegal byte sequence)
Hmmm... if you are using GHC 6.12, what's your locale? I've only seen error messages like that when using something with a different encoding than your locale. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Yes, it's GHC 6.12.3. Character escapes are the least I want to see.
As for the locale:
$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Unfortunately, I'm completely lame with locales and googling doesn't
help very much to change "C" to something like UTF-8 (Mac, Snow
Leopard). But the terminal does support UTF-8 anyway, as the output
from my previous message shows (the main function output).
On 29 August 2010 11:49, Ivan Lazar Miljenovic
2010/8/29 Peter Gromov
: Thanks for getting back to me. I was imprecise, by UTF8 characters I mean Unicode. My source files are UTF8-encoded, and Haskell reads them fine, it only has problems outputting them in a readable way. At this point I'm not talking of any I/O besides plain console output.
How are you outputting them? Unless you use a textual String I/O function with GHC 6.12, then by default showing a String will print it with escape characters for non-latin characters.
Prelude> :load utest.hs [1 of 1] Compiling Main ( utest.hs, interpreted ) Ok, modules loaded: Main. *Main> main это тест *Main> runTestTT test1 ### Failure: fail expected: *** Exception: <stderr>: hPutChar: invalid argument (Illegal byte sequence)
Hmmm... if you are using GHC 6.12, what's your locale? I've only seen error messages like that when using something with a different encoding than your locale.
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On 29 August 2010 21:24, Peter Gromov
Yes, it's GHC 6.12.3. Character escapes are the least I want to see.
*sigh* that could be because HUnit is calling print ( = putStrLn . show) on the String rather than putStrLn; unfortunately unless it decides to have a special case for Strings you probably won't be able to get around this. One option, though, is rather than returning a String, return a newtype wrapped around a String such that it's show instance is the String it contains...
As for the locale:
$ locale LANG= LC_COLLATE="C" LC_CTYPE="C" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL=
Unfortunately, I'm completely lame with locales and googling doesn't help very much to change "C" to something like UTF-8 (Mac, Snow Leopard). But the terminal does support UTF-8 anyway, as the output from my previous message shows (the main function output).
No idea about dealing with Macs, sorry. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 8/29/10 07:35 , Ivan Lazar Miljenovic wrote:
On 29 August 2010 21:24, Peter Gromov
wrote: $ locale LANG= LC_COLLATE="C" LC_CTYPE="C" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL=
Unfortunately, I'm completely lame with locales and googling doesn't help very much to change "C" to something like UTF-8 (Mac, Snow Leopard). But the terminal does support UTF-8 anyway, as the output from my previous message shows (the main function output).
No idea about dealing with Macs, sorry.
"export LC_ALL=en_US.UTF-8" Terminal.app normally does this automatically; see Preferences > Settings, Advanced tab for whatever terminal definition you are using. At the bottom of the pane, under "International", make sure "Set locale environment variables on startup" is checked. Another way to do it is to arrange for the Finder to launch with the locale pre-set. You want something like this: mress:20011 Z$ plutil -convert xml1 -o - ~/.MacOSX/environment.plist <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>LANG</key> <string>en_US.UTF-8</string> <key>LC_ALL</key> <string>en_US.UTF-8</string> </dict> </plist> You'll have to log out and back in to activate it. (Alternatively to manipulating plists by hand, use something like URI:http://www.apple.com/downloads/macosx/system_disk_utilities/environmentv... ( http://preview.tinyurl.com/3ythrds ).) - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkx6jeoACgkQIn7hlCsL25XDxwCguNj0ciIDtCOJJvXaCslhTlx0 PG4AoKF79lzQxNfFRQuJjiFdPGfIMzaZ =GacS -----END PGP SIGNATURE-----
participants (3)
-
Brandon S Allbery KF8NH
-
Ivan Lazar Miljenovic
-
Peter Gromov