Re: [Haskell-cafe] Unicode pretty-printing

29 Aug 2010

      Thanks for getting back to me. I was imprecise, by UTF8 characters I
mean Unicode. My source files are UTF8-encoded, and Haskell reads them
fine, it only has problems outputting them in a readable way. At this
point I'm not talking of any I/O besides plain console output.

Not using Show is not that of a choice, since I'm using HUnit which
uses Show and prints the test results via the standard output
functions. I've tried to wrap my strings and redefine Show so that it
doesn't escape anything, but the standard output functions don't
accept that, and HUnit doesn't know anything about System.IO.UTF8:

----
import System.IO.UTF8
import qualified System.IO
import Test.HUnit

newtype UString = UString String

instance Show UString where
  show (UString s) = s
instance Eq UString where
  (==) (UString s1) (UString s2) = s1 == s2

test1 = TestCase (assertEqual "fail" (UString "абв") (UString "где"))

main =
	System.IO.hSetBinaryMode System.IO.stdout True >>
	System.IO.UTF8.putStrLn "это тест"
---------
Prelude> :load utest.hs
[1 of 1] Compiling Main             ( utest.hs, interpreted )
Ok, modules loaded: Main.
*Main> main
это тест
*Main> runTestTT test1
### Failure:
fail
expected: *** Exception: <stderr>: hPutChar: invalid argument (Illegal
byte sequence)
---------

I've tried replacing UString X in the test with Data.Text.pack X and
even desperately with Data.Text.Encoding.encodeUtf8 (Data.Text.pack
X), but no dice. Though this time instead of crashes I get the good
old escapes.

On 29 August 2010 00:09, Yitzchak Gale  wrote:
...
Peter Gromov wrote:
...
Unfortunately, Haskell escapes UTF8 characters.
What do you mean by "UTF8 characters"?
Each element of the Char type represents a single Unicode
character, not encoded in UTF-8 or any other encoding.
When you read a text file using the traditional IO functions,
recent versions of GHC will use the encoding of the
"current locale" (whatever that means on your system)
to decode the input into Unicode, unless you specify
otherwise. The same is true for writing to the console or
to a file.
As Don pointed out, you may be interested in
using the newer Data.Text instead, especially when
encodings matter to you. It will usually be faster than
traditional IO, and it is designed to be the new standard
for representing text in Haskell.
A third option would be to read the data as raw binary
bytes, without any decoding, using Data.ByteString.
Then it is totally up to you to do any decoding or
encoding.
In any case, the standard Show instances will not be
able to do a very good job of displaying non-ASCII
characters; Show cannot make very many assumptions
about your data or your environment. As Don suggested,
you may want to define your own type class similar to
Show that does what you want.
Regards,
Yitz