Re: [xmonad] spawn functions are not unicode safe

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On Thu, Jan 15, 2009 at 12:21 PM, Eric Mertens wrote:
On Thu, 2009-01-15 at 12:04 -0500, Gwern Branwen wrote:
Perhaps we're over-thinking all this. Is it a problem in any way to run encodeString over a String that is just normal ASCII (that is, no funky Unicode)?
Eric: could we just mindlessly call encodeString on everything going into spawn/safeSpawn?
ASCII is valid UTF-8 encoded Unicode, however Latin1 is not. So as long as you stick to ASCII (values less than 128) you can treat them as UTF-8.
ISO 8859-1 and ASCII Extended are not valid UTF-8, however (due to their use of the values 128-255)
Does this answer your question?
If I'm understanding you, the answer is 'you can safely call encodeString on ASCII text, and UTF text, but you cannot on ISO8859-1 & ASCII Extended'. So we can either default to calling encodeString, checking whether it's ISO/Extended (and not calling encodeString if True); or we can default to not calling encodeString, and enabling it if a check for UTF returns true. I guess since Alexey has already provided a check for UTF, then we should probably use the latter strategy. - -- gwern -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEAREKAAYFAklvv9QACgkQvpDo5Pfl1oIYWACcCJclUot9NbxFmQLjdckDwc4H fN0AoJfpM3bD44z7rKHsbEYF8H/7Y9xY =Niw2 -----END PGP SIGNATURE-----

If I'm understanding you, the answer is 'you can safely call encodeString on ASCII text, and UTF text, but you cannot on ISO8859-1 & ASCII Extended'. So we can either default to calling encodeString, checking whether it's ISO/Extended (and not calling encodeString if True); or we can default to not calling encodeString, and enabling it if a check for UTF returns true.
I guess since Alexey has already provided a check for UTF, then we should probably use the latter strategy.
The only problem are user with one byte encodings. encodeString can be safely called on string which contain only ASCII characters. For ASCII input encodeString == id. So nothing will change for ASCII. Users with UTF8 locale will get nicely encoded strings. Users will get garbage. But they have it anyway, it only will look different. So I think it sound solution to wrap everything into encodeString. It is not The Right Way To Do Things. It's only a workaround... still better than nothing. It's job for standard libraries... not for software developers. For more information on issue read below. There is no such thing as [some encoding] encoded string in haskell (most of the time). Strings in haskell _are_ unicode. Char is valid unicode code point. Not byte, word32, etc. It's fairly abstract code point. Usually they contain only ASCII characters but they unicode nevertheless. Char is represented somehow but one shouldn't bother about it most of the time. Problems arise when char passed to outside world. World understand sequences of bytes so strings must be encoded somehow. Standard library uses very simple method: (\c -> c .&. 0xff). Every character translated to one byte. Simple but works only for ASCII (and maybe latin-1). Because of that behavior all that encodeStrings are needed. Some examples to illustrate above: ы U+044B Name: CYRILLIC SMALL LETTER YERU Prelude> fromEnum $ 'ы' 1099 -- (1099 == 0x44b) Prelude> putStrLn . encodeString $ [toEnum 0x44b] ы Prelude> putStrLn $ [toEnum 0x44b] K Prelude> putStrLn . encodeString $ [toEnum $ 0xff .&. 0x44b] K P.S. It is not safe to call encodeString on UTF8 encoded string. No encoding. Pass string as it is Prelude> putStrLn "Ну что тут с уникодом?" C GB> BCB A C=8:>4> Encode string in UTF8 Prelude> putStrLn . encodeString $ "Ну что тут с уникодом?" Ну что тут с уникодом? Encode string which already UTF8 encoded. Prelude> putStrLn . encodeString . encodeString $ "Ну что тут с уникодом?" ÐÑ ÑÑо ÑÑÑ Ñ Ñникодом?

Khudyakov Alexey
So I think it sound solution to wrap everything into encodeString. It is not The Right Way To Do Things. It's only a workaround... still better than nothing. It's job for standard libraries... not for software developers.
Well put. But it would be best to keep Xmonad encoding agnostic. Filenames are byte sequences, so spawn should take [Word8] instead of [Char]. Otherwise the best it can do is convert [Char] to [Word8] by truncating each Char. Nothing guarrantees that every executable has an UTF-8 name, after all, even if the default system locale is UTF-8. It's better to investigate where the argument of spawn comes from, and handle the problem there. Xlib knows about character encodings, maybe we could use its facilities and thus avoid adding further dependencies. -- Feri.

Well put. But it would be best to keep Xmonad encoding agnostic. Filenames are byte sequences, so spawn should take [Word8] instead of [Char]. Otherwise the best it can do is convert [Char] to [Word8] by truncating each Char. Nothing guarrantees that every executable has an UTF-8 name, after all, even if the default system locale is UTF-8.
It's good thing to keep xmonad encoding agnostic. This is surely out of scope of window manager. Making spawn to accept [Word8] instead of String does not solve problem. It's just reformulate it and make more obvious. Converting [Char] to [Word8] is string encoding ;-).
It's better to investigate where the argument of spawn comes from, and handle the problem there. Xlib knows about character encodings, maybe we could use its facilities and thus avoid adding further dependencies.
Arguments of spawn can come from anywhere. User input, hardcoded strings etc... This sources are far too numerous to handle encoding by themselves. As for xlib. AFAIK everything is OK with it. Real problem is standard library - putStrLn/executeFile/etc... Only solution which allow to keep xmonad encoding agnostic and still allows to encode strings is to push choice of encoding to user. At least I could not imagine anything else. I've come up with two possible solutions. 1. Add encoding field to XConfig.
data XConfig l = XConfig { -- skipped encoding :: String -> String -- User supplied encoding } Downsides - requires to pass config to every function like spawn => change API => break awfully lots of code.
2. Something along lines:
-- IORef which stores encoding functions. It's value should be set -- from user's config and never modified. coding :: IO (IORef (String -> String)) coding = newIORef id
-- Function to get encoder. Should be the only way to obtain it. getEncoder :: IO (String -> String) getEncoder = readIORef coding
-- putStrLn which works with unicode string (sometimes) encodedPutStrLn :: String -> IO () encodedPutStrLn str = do encode <- getEncoder putStrLn . encode $ str Upside: break nothing. Only change few functions and provide way to set encoding. Downside: kludge.
Comment, suggestions, ideas are welcome. P.S. I found this in documentation for XMonad.Util.XSelection. So not only spawn affected * Unicode handling is busted. But it's still better than calling 'chr' to translate to ASCII, at least. As near as I can tell, the mangling happens when the String is outputted somewhere, such as via promptSelection's passing through the shell, or GHCi printing to the terminal. utf-string has IO functions which can fix this, though I do not know have to use them here. It's a complex issue; see http://www.haskell.org/pipermail/xmonad/2007-September/001967.html http://www.haskell.org/pipermail/xmonad/2007-September/001966.html.
participants (3)
-
Ferenc Wagner
-
Gwern Branwen
-
Khudyakov Alexey