
Andrea Rossato
I hope Mats Jansborg is still reading the mailing list: hopefully he could give us some direction.
I am, sorry for the late reply. Although I'm not sure I'm comfortable being appointed some kind of x11 unicode authority; I'm completely new to low-level X11 programming and what knowledge I have can be had more quickly and reliably by reading the X11 manual, the ICCCM and Keith Packard's paper at http://keithp.com/~keithp/talks/selection.ps. Never the less, since you ask I'll try to answer as well as I can.
On Sat, Aug 25, 2007 at 05:43:53AM -0400, Gwern Branwen wrote:
However, while it works fine for your basic bread and butter ASCII characters, I noticed that it does terrible things to more exotic phrases involving Unicode, such as "Henri Poincaré". I borrowed some code from utf-string, and that improved it a little bit - "Henri Poincar�" now becomes "Henri Poincarý" which is still better than "Henri Poincar\245" or whatever.
Well, it is not better: Poincar\245 is right, the other is wrong, since part of the character has been truncated.
The problem is not Haskell, it's me (and you..;-): actually
I'm afraid it's a little of both :) Henri Poincaré is not quite "exotic" enough for X to have any problems with it, for this simple case it is actually all Haskell's fault. Henri Poincaré is completely representable in ISO-8859-1 which is what you get when you ask for the selection encoded as STRING. The problem is that most Haskell functions that deal with String and interface with the operating system are broken, they behave as though they had types involving [Word8] instead of [Char] where only the least significant byte is used. As a workaround the programmer must manually convert the Haskell String to the locale encoding, represented as a Haskell [Char] using only the low byte of each Char. To do this properly you need iconv or something similar such as http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding-0.2. To support only utf-8 locales is perhaps a little better than supporting only ASCII or ISO-8859-1, but still fundamentally wrong in my opinion. The other problem is that you ask for the selection in the STRING encoding. This limits its usefulness immensely, as STRING is unlikely to support many of the characters in the user's locale. X11 and ICCCM defines a method of negotiating the format of the selection. Both UTF8_STRING (which is not in ICCCM but supported by most modern applications) and COMPOUND_TEXT should imho be preferred to STRING which can be used as fallback encoding if neither of the two first two are available. I've previously posted an example of how to read compound text properties in my patch to the NamedWindow module, although that method is not portable either since it uses withCWString from Foreign.C.String which works correctly only if __STDC_ISO_10646__ is defined (i.e. if wchar_t is UTF-32).
If I test hxsel with the first 3 Cyrillic characters of this page: http://gorgias.mine.nu/unicode.php I get: \u041f\u0440\u0438 which is the correct answer. The problem is: how can I convert this unicode characters into something that can be printed?
Yes, since it is impossible to represent Cyrillic characters in ISO-8859-1, Firefox (or whatever browser you use to provide the selection) makes its best effort and renders it as the string ['\\', '0', '4', '1', ... ]. The proper solution is not to try to interpret this string as an application is free to do whatever it wants with unrepresentable characters (including omitting them or replacing them with e.g. '?'). The solution is to ask for the selection in an encoding where these characters are representable, convert it to Haskell String, convert that String to the locale C encoding and pass it to the operating system. In simple applications you can sometimes skip the middle step and ask for the selection in the locale encoding and pass that directly back to the os. This is very convenient when it works but it breaks down as soon as you need to interpret the data as a string of characters, for instance because you want to prepend "/bin/sh -c" to it. In addition, not having anything to do with unicode, from my quick read through the ICCCM it appears the requestor (you) is responsible for deleting the target property, and that currentTime should not be used in the xConvertSelection request. Note also that the window is not destroyed in the case where the selection is properly converted and that you do not check that the conversion has succeeded before starting the transfer of the property. /Mats