XSelection.hs - Unicode problems

So I was working on splitting out Andrea's hxsel into something useful for one's Config.hs, and I'm basically done, and it seems to work well. However, while it works fine for your basic bread and butter ASCII characters, I noticed that it does terrible things to more exotic phrases involving Unicode, such as "Henri Poincaré". I borrowed some code from utf-string, and that improved it a little bit - "Henri Poincaré" now becomes "Henri Poincarý" which is still better than "Henri Poincar\245" or whatever. As far as I can tell, whenever a String involving UTF-8 stuff leaves the Haskell environment, it gets messed up. This is a little hard to test since things get messed up even when I test them in GHCi. :) But I'm sure it has to be something to do with Haskell, since I know I can copy and paste such strings with no problems using the mouse, and I know that my shell isn't the problem and nor are the Surfraw programs I use to pass them to Firefox (and Firefox obviously has no problems handling those characters). utf-string does have a special IO module which can print out and receive UTF-8 strings, but it's limited to things like 'putStr' and 'getContents'. There doesn't seem to be anything that would be useful for fixing 'spawn', which seems to be the specific function that needs fixing, since it isn't using any of them and executeFile doesn't seem to use any of them either. spawn x = io $ do pid <- forkProcess $ do forkProcess (createSession >> executeFile "/bin/sh" False ["-c", x] Nothing) exitWith ExitSuccess getProcessStatus True False pid return () Any ideas on how to fix this? ----------------------------------------------------------------------------- -- | -- Module : XMonadContrib.XSelection -- Copyright : (C) 2007 Andrea Rossato -- License : BSD3 -- -- Maintainer : andrea.rossato@unibz.it -- Stability : unstable -- Portability : unportable -- -- A module for accessing and manipulating the X Window mouse selection (used in copy and pasting). -- ----------------------------------------------------------------------------- {- $usage Add 'import XMonadContrib.XSelection' to the top of Config.hs Then make use of getSelection or promptSelection as needed. TODO: * Fix Unicode handling. Currently it's still better than calling 'chr' to translate to ASCII, though. As near as I can tell, the mangling happens when the String is outputted somewhere, such as via promptSelection's passing through the shell, or GHCi printing to the terminal. utf-string has IO functions which can fix this, though I do not know have to use them here. * Add a 'putSelection' to allow modification of the selection. * Possibly add some more elaborate functionality: Emacs' registers are nice. -} module XMonadContrib.XSelection (getSelection, promptSelection) where -- getSelection's imports: import Graphics.X11.Xlib (allocaXEvent, createSimpleWindow, defaultScreen, destroyWindow, internAtom, nextEvent, openDisplay, rootWindow, selectionNotify, sTRING) import Graphics.X11.Xlib.Extras (currentTime, ev_event_type, getEvent, getWindowProperty8, xConvertSelection) import Data.Maybe (fromMaybe) import Data.Char (chr) -- promptSelection's imports: import XMonad (io, spawn, X ()) -- decode's imports import Foreign -- (Word8(), (.&.), shiftL, (.|.)) -- | Returns a String corresponding to the current mouse selection in X; if there is none, an empty string is returned.. Note that this is -- only reliable for ASCII text and currently mangles/escapes more complex UTF-8 characters. getSelection :: IO String getSelection = do dpy <- openDisplay "" let dflt = defaultScreen dpy rootw <- rootWindow dpy dflt win <- createSimpleWindow dpy rootw 0 0 200 100 0 0 0 p <- internAtom dpy "PRIMARY" True clp <- internAtom dpy "BLITZ_SEL_STRING" False xConvertSelection dpy p sTRING clp win currentTime allocaXEvent $ \e -> do nextEvent dpy e ev <- getEvent e if ev_event_type ev == selectionNotify then do res <- getWindowProperty8 dpy clp win return $ decode . fromMaybe [] $ res else destroyWindow dpy win >> return "" -- | A wrapper around getSelection. Makes it convenient to run a program with the current selection as an argument. This is convenient -- for handling URLs, in particular. For example, in your Config.hs you could bind a key to 'promptSelection "firefox"'; this would allow you to -- highlight a URL string and then immediately open it up in Firefox. promptSelection :: String -> X () promptSelection app = spawn . ((app ++ " ") ++) =<< io getSelection {- UTF-8 decoding for internal use in getSelection. This code is totally stolen from Eric Mertens's utf-string library http://code.haskell.org/utf8-string/ (version 0.1), which fortunately is BSD-3 licensed, so I can just copy it into this BSD-3 licensed module. I guess it'd be better to just import Codec.Binary.UTF8.String (decode), but then users of this would need to install it, and Xmonad has enough dependencies as it is. -} decode :: [Word8] -> String decode [ ] = "" decode (c:cs) | c < 0x80 = chr (fromEnum c) : decode cs | c < 0xc0 = replacement_character : decode cs | c < 0xe0 = multi_byte 1 0x1f 0x80 | c < 0xf0 = multi_byte 2 0xf 0x800 | c < 0xf8 = multi_byte 3 0x7 0x10000 | c < 0xfc = multi_byte 4 0x3 0x200000 | c < 0xfe = multi_byte 5 0x1 0x4000000 | otherwise = replacement_character : decode cs where replacement_character :: Char replacement_character = '\xfffd' multi_byte :: Int -> Word8 -> Int -> [Char] multi_byte i mask overlong = aux i cs (fromEnum (c .&. mask)) where aux 0 rs acc | overlong <= acc && acc <= 0x10ffff && (acc < 0xd800 || 0xdfff < acc) && (acc < 0xfffe || 0xffff < acc) = chr acc : decode rs | otherwise = replacement_character : decode rs aux n (r:rs) acc | r .&. 0xc0 == 0x80 = aux (n-1) rs $ shiftL acc 6 .|. fromEnum (r .&. 0x3f) aux _ rs _ = replacement_character : decode rs -- gwern DDR&E E911 BCCI State EO PRIME iButton WID OAU Flintlock

On Sat, Aug 25, 2007 at 05:43:53AM -0400, Gwern Branwen wrote:
However, while it works fine for your basic bread and butter ASCII characters, I noticed that it does terrible things to more exotic phrases involving Unicode, such as "Henri Poincaré". I borrowed some code from utf-string, and that improved it a little bit - "Henri Poincar�" now becomes "Henri Poincarý" which is still better than "Henri Poincar\245" or whatever.
Well, it is not better: Poincar\245 is right, the other is wrong, since part of the character has been truncated.
As far as I can tell, whenever a String involving UTF-8 stuff leaves the Haskell environment, it gets messed up. This is a little hard to test since things get messed up even when I test them in GHCi. :) But I'm sure it has to be something to do with Haskell, since I know I can copy and paste such strings with no problems using the mouse, and I know that my shell isn't the problem and nor are the Surfraw programs I use to pass them to Firefox (and Firefox obviously has no problems handling those characters).
The problem is not Haskell, it's me (and you..;-): actually , while I have a clear undertsnding of Unicode and utf-8 in general, I don't know how to handle multi-byte characters in any language that is not Php...;-) I hope Mats Jansborg is still reading the mailing list: hopefully he could give us some direction.
Any ideas on how to fix this?
----------------------------------------------------------------------------- -- | -- Module : XMonadContrib.XSelection -- Copyright : (C) 2007 Andrea Rossato
thanks for the gift, but copyright should belong to you, since you wrote it...;-) If I test hxsel with the first 3 Cyrillic characters of this page: http://gorgias.mine.nu/unicode.php I get: \u041f\u0440\u0438 which is the correct answer. The problem is: how can I convert this unicode characters into something that can be printed? Andrea

Andrea Rossato
I hope Mats Jansborg is still reading the mailing list: hopefully he could give us some direction.
I am, sorry for the late reply. Although I'm not sure I'm comfortable being appointed some kind of x11 unicode authority; I'm completely new to low-level X11 programming and what knowledge I have can be had more quickly and reliably by reading the X11 manual, the ICCCM and Keith Packard's paper at http://keithp.com/~keithp/talks/selection.ps. Never the less, since you ask I'll try to answer as well as I can.
On Sat, Aug 25, 2007 at 05:43:53AM -0400, Gwern Branwen wrote:
However, while it works fine for your basic bread and butter ASCII characters, I noticed that it does terrible things to more exotic phrases involving Unicode, such as "Henri Poincaré". I borrowed some code from utf-string, and that improved it a little bit - "Henri Poincar�" now becomes "Henri Poincarý" which is still better than "Henri Poincar\245" or whatever.
Well, it is not better: Poincar\245 is right, the other is wrong, since part of the character has been truncated.
The problem is not Haskell, it's me (and you..;-): actually
I'm afraid it's a little of both :) Henri Poincaré is not quite "exotic" enough for X to have any problems with it, for this simple case it is actually all Haskell's fault. Henri Poincaré is completely representable in ISO-8859-1 which is what you get when you ask for the selection encoded as STRING. The problem is that most Haskell functions that deal with String and interface with the operating system are broken, they behave as though they had types involving [Word8] instead of [Char] where only the least significant byte is used. As a workaround the programmer must manually convert the Haskell String to the locale encoding, represented as a Haskell [Char] using only the low byte of each Char. To do this properly you need iconv or something similar such as http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding-0.2. To support only utf-8 locales is perhaps a little better than supporting only ASCII or ISO-8859-1, but still fundamentally wrong in my opinion. The other problem is that you ask for the selection in the STRING encoding. This limits its usefulness immensely, as STRING is unlikely to support many of the characters in the user's locale. X11 and ICCCM defines a method of negotiating the format of the selection. Both UTF8_STRING (which is not in ICCCM but supported by most modern applications) and COMPOUND_TEXT should imho be preferred to STRING which can be used as fallback encoding if neither of the two first two are available. I've previously posted an example of how to read compound text properties in my patch to the NamedWindow module, although that method is not portable either since it uses withCWString from Foreign.C.String which works correctly only if __STDC_ISO_10646__ is defined (i.e. if wchar_t is UTF-32).
If I test hxsel with the first 3 Cyrillic characters of this page: http://gorgias.mine.nu/unicode.php I get: \u041f\u0440\u0438 which is the correct answer. The problem is: how can I convert this unicode characters into something that can be printed?
Yes, since it is impossible to represent Cyrillic characters in ISO-8859-1, Firefox (or whatever browser you use to provide the selection) makes its best effort and renders it as the string ['\\', '0', '4', '1', ... ]. The proper solution is not to try to interpret this string as an application is free to do whatever it wants with unrepresentable characters (including omitting them or replacing them with e.g. '?'). The solution is to ask for the selection in an encoding where these characters are representable, convert it to Haskell String, convert that String to the locale C encoding and pass it to the operating system. In simple applications you can sometimes skip the middle step and ask for the selection in the locale encoding and pass that directly back to the os. This is very convenient when it works but it breaks down as soon as you need to interpret the data as a string of characters, for instance because you want to prepend "/bin/sh -c" to it. In addition, not having anything to do with unicode, from my quick read through the ICCCM it appears the requestor (you) is responsible for deleting the target property, and that currentTime should not be used in the xConvertSelection request. Note also that the window is not destroyed in the case where the selection is properly converted and that you do not check that the conversion has succeeded before starting the transfer of the property. /Mats

On Sun, Sep 02, 2007 at 01:00:53AM +0200, Mats Jansborg wrote:
Andrea Rossato
writes: I hope Mats Jansborg is still reading the mailing list: hopefully he could give us some direction.
I am, sorry for the late reply. Although I'm not sure I'm comfortable being appointed some kind of x11 unicode authority; I'm completely new to low-level X11 programming and what knowledge I have can be had more quickly and reliably by reading the X11 manual, the ICCCM and Keith Packard's paper at http://keithp.com/~keithp/talks/selection.ps.
You may not be an authority but you gave me a great help, with many useful references. Thanks you!
The other problem is that you ask for the selection in the STRING encoding. This limits its usefulness immensely, as STRING is unlikely to support many of the characters in the user's locale. X11 and ICCCM defines a method of negotiating the format of the selection. Both UTF8_STRING (which is not in ICCCM but supported by most modern applications) and COMPOUND_TEXT should imho be preferred to STRING which can be used as fallback encoding if neither of the two first two are available. I've previously posted an example of how to read compound text properties in my patch to the NamedWindow module, although that method is not portable either since it uses withCWString from Foreign.C.String which works correctly only if __STDC_ISO_10646__ is defined (i.e. if wchar_t is UTF-
I don't get all the issues involved in converting the selection yet. As a quick try I started using COMPUND_TEXT, which seems to be working well with the ISO-8859-1 character set (Poincar� now works!). For the rest, I think I'll have something to study in the next days...;-) Thank you. Andrea ps: Gwern, you can update hxsel.
participants (3)
-
Andrea Rossato
-
Gwern Branwen
-
Mats Jansborg