Re: [Xmonad] XSelection.hs - Unicode problems

1 Sep 2007

      Andrea Rossato  writes:
...
I hope Mats Jansborg is still reading the mailing list: hopefully he
could give us some direction.
I am, sorry for the late reply. Although I'm not sure I'm comfortable
being appointed some kind of x11 unicode authority; I'm completely new
to low-level X11 programming and what knowledge I have can be had more
quickly and reliably by reading the X11 manual, the ICCCM and Keith
Packard's paper at http://keithp.com/~keithp/talks/selection.ps.

Never the less, since you ask I'll try to answer as well as I can.
...
On Sat, Aug 25, 2007 at 05:43:53AM -0400, Gwern Branwen wrote:
...
However, while it works fine for your basic bread and butter ASCII
characters, I noticed that it does terrible things to more exotic
phrases involving Unicode, such as "Henri Poincaré". I borrowed
some code from utf-string, and that improved it a little bit -
"Henri Poincar�" now becomes "Henri Poincarý" which is still
better than "Henri Poincar\245" or whatever.
Well, it is not better: Poincar\245 is right, the other is wrong,
since part of the character has been truncated.
...
The problem is not Haskell, it's me (and you..;-): actually
I'm afraid it's a little of both :)

Henri Poincaré is not quite "exotic" enough for X to have any problems
with it, for this simple case it is actually all Haskell's fault. Henri
Poincaré is completely representable in ISO-8859-1 which is what you get
when you ask for the selection encoded as STRING.

The problem is that most Haskell functions that deal with String and
interface with the operating system are broken, they behave as though
they had types involving [Word8] instead of [Char] where only the least
significant byte is used. As a workaround the programmer must manually
convert the Haskell String to the locale encoding, represented as a
Haskell [Char] using only the low byte of each Char. To do this properly
you need iconv or something similar such as
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding-0.2.

To support only utf-8 locales is perhaps a little better than supporting
only ASCII or ISO-8859-1, but still fundamentally wrong in my opinion.

The other problem is that you ask for the selection in the STRING
encoding. This limits its usefulness immensely, as STRING is unlikely to
support many of the characters in the user's locale. X11 and ICCCM
defines a method of negotiating the format of the selection. Both
UTF8_STRING (which is not in ICCCM but supported by most modern
applications) and COMPOUND_TEXT should imho be preferred to STRING which
can be used as fallback encoding if neither of the two first two are
available. I've previously posted an example of how to read compound text
properties in my patch to the NamedWindow module, although that method
is not portable either since it uses withCWString from Foreign.C.String
which works correctly only if __STDC_ISO_10646__ is defined (i.e. if
wchar_t is UTF-32).
...
If I test hxsel with the first 3 Cyrillic characters of this page:
http://gorgias.mine.nu/unicode.php
I get:
\u041f\u0440\u0438
which is the correct answer. The problem is: how can I convert this
unicode characters into something that can be printed?
Yes, since it is impossible to represent Cyrillic characters in
ISO-8859-1, Firefox (or whatever browser you use to provide the
selection) makes its best effort and renders it as the string ['\\',
'0', '4', '1', ... ]. The proper solution is not to try to interpret
this string as an application is free to do whatever it wants with
unrepresentable characters (including omitting them or replacing them
with e.g. '?'). The solution is to ask for the selection in an encoding
where these characters are representable, convert it to Haskell String,
convert that String to the locale C encoding and pass it to the
operating system. In simple applications you can sometimes skip the
middle step and ask for the selection in the locale encoding and pass
that directly back to the os. This is very convenient when it works but
it breaks down as soon as you need to interpret the data as a string of
characters, for instance because you want to prepend "/bin/sh -c" to it.

In addition, not having anything to do with unicode, from my quick read
through the ICCCM it appears the requestor (you) is responsible for
deleting the target property, and that currentTime should not be used in
the xConvertSelection request. Note also that the window is not
destroyed in the case where the selection is properly converted and that
you do not check that the conversion has succeeded before starting the
transfer of the property.

/Mats