spawn functions are not unicode safe

Hello. There is problem with `spawn' function as well with *spawn family from XMonad.Util.Run. They mangle unicode symbols which are passed to them. This is because they make use of `executeFile' which silently truncate each letter to one byte. Simplest workaround is to use utf8-string package. It would work only on systems with UTF8 locales but now they are majority I hope.
import Codec.Binary.UTF8.String -- | Unicode safe spawn spawnU :: MonadIO m => String -> m () spawnU = spawn . encodeString
The same possible for all *spawn functions. I think it's worth to include unicodified versions to XMonadContrib but not sure is there anyone who need such functionality -- Alexey Khudyakov

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On Wed, Jan 14, 2009 at 5:46 PM, Khudyakov Alexey wrote:
Hello.
There is problem with `spawn' function as well with *spawn family from XMonad.Util.Run. They mangle unicode symbols which are passed to them. This is because they make use of `executeFile' which silently truncate each letter to one byte.
Simplest workaround is to use utf8-string package. It would work only on systems with UTF8 locales but now they are majority I hope.
import Codec.Binary.UTF8.String -- | Unicode safe spawn spawnU :: MonadIO m => String -> m () spawnU = spawn . encodeString
The same possible for all *spawn functions. I think it's worth to include unicodified versions to XMonadContrib but not sure is there anyone who need such functionality
I think it's worth including such functionality (especially as we already depend on utf8-string in XMC). I often run Wikipedia & Google searches on terms which include accents and other such UTFy things, and it's a little tiresome fixing the search. Likely they are coming pre-mangled by X, but if the spawn functions are also guilty then this would at least be a step further. That said, it's also worth asking where this should be done. Do we leave XMonad core alone, and provide spawnU in XMC (and rewrite the ~53 XMC calls to call spawnU)? - -- gwern -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEAREKAAYFAklucUUACgkQvpDo5Pfl1oLhfQCgiKEq+7pq11MwIzke7bRdbWg2 Sh4AnjwIBAuvDnWMmNSq0v9Jv0GcEhgL =O9tZ -----END PGP SIGNATURE-----

On Wed, Jan 14, 2009 at 06:12:07PM -0500, Gwern Branwen wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Wed, Jan 14, 2009 at 5:46 PM, Khudyakov Alexey wrote:
Hello.
There is problem with `spawn' function as well with *spawn family from XMonad.Util.Run. They mangle unicode symbols which are passed to them. This is because they make use of `executeFile' which silently truncate each letter to one byte.
Simplest workaround is to use utf8-string package. It would work only on systems with UTF8 locales but now they are majority I hope.
import Codec.Binary.UTF8.String -- | Unicode safe spawn spawnU :: MonadIO m => String -> m () spawnU = spawn . encodeString
The same possible for all *spawn functions. I think it's worth to include unicodified versions to XMonadContrib but not sure is there anyone who need such functionality
I think it's worth including such functionality (especially as we already depend on utf8-string in XMC). I often run Wikipedia & Google searches on terms which include accents and other such UTFy things, and it's a little tiresome fixing the search. Likely they are coming pre-mangled by X, but if the spawn functions are also guilty then this would at least be a step further.
That said, it's also worth asking where this should be done. Do we leave XMonad core alone, and provide spawnU in XMC (and rewrite the ~53 XMC calls to call spawnU)?
whats the benefit to maintaining non unicode spawn behavior? is it needlessly complex to have a spawnA and spawnU and then the actual spawn function determines the more appropriate function to use based on the string itself?
- -- gwern -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux)
iEYEAREKAAYFAklucUUACgkQvpDo5Pfl1oLhfQCgiKEq+7pq11MwIzke7bRdbWg2 Sh4AnjwIBAuvDnWMmNSq0v9Jv0GcEhgL =O9tZ -----END PGP SIGNATURE----- _______________________________________________ xmonad mailing list xmonad@haskell.org http://www.haskell.org/mailman/listinfo/xmonad

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On Wed, Jan 14, 2009 at 6:58 PM, Sean Escriva wrote:
On Wed, Jan 14, 2009 at 06:12:07PM -0500, Gwern Branwen wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Wed, Jan 14, 2009 at 5:46 PM, Khudyakov Alexey wrote:
Hello.
There is problem with `spawn' function as well with *spawn family from XMonad.Util.Run. They mangle unicode symbols which are passed to them. This is because they make use of `executeFile' which silently truncate each letter to one byte.
Simplest workaround is to use utf8-string package. It would work only on systems with UTF8 locales but now they are majority I hope.
import Codec.Binary.UTF8.String -- | Unicode safe spawn spawnU :: MonadIO m => String -> m () spawnU = spawn . encodeString
The same possible for all *spawn functions. I think it's worth to include unicodified versions to XMonadContrib but not sure is there anyone who need such functionality
I think it's worth including such functionality (especially as we already depend on utf8-string in XMC). I often run Wikipedia & Google searches on terms which include accents and other such UTFy things, and it's a little tiresome fixing the search. Likely they are coming pre-mangled by X, but if the spawn functions are also guilty then this would at least be a step further.
That said, it's also worth asking where this should be done. Do we leave XMonad core alone, and provide spawnU in XMC (and rewrite the ~53 XMC calls to call spawnU)?
whats the benefit to maintaining non unicode spawn behavior? is it needlessly complex to have a spawnA and spawnU and then the actual spawn function determines the more appropriate function to use based on the string itself?
How would it determine that? I don't know that Data.Char.isLatin1 would suffice. (Incidentally, I added an encodeString to safeSpawn and unsafeSpawn in XMonad.Util.Run, and my searches seem to be passing through correctly.) - -- gwern -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEAREKAAYFAkluggAACgkQvpDo5Pfl1oLDegCeIY4MpPIDnom2bsauGjTclJGC 2+UAmwVBB8QKb6ERYovane8Kx1MDeTav =wFQu -----END PGP SIGNATURE-----

* Gwern Branwen
whats the benefit to maintaining non unicode spawn behavior? is it needlessly complex to have a spawnA and spawnU and then the actual spawn function determines the more appropriate function to use based on the string itself?
How would it determine that? I don't know that Data.Char.isLatin1 would suffice.
RFC 3629 [1] states: o UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e., the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length. However, no references to the algorithm itself are given. Google brought me this sample algorithm [2]. Probably it's worth to implement something like that and include into utf8-string if it's not already there. 1. http://www.ietf.org/rfc/rfc3629.txt 2. http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html -- Roman I. Cheplyaka (aka Feuerbach @ IRC) http://ro-che.info/docs/xmonad.hs

On Thursday 15 January 2009 16:53:49 Roman Cheplyaka wrote:
RFC 3629 [1] states:
o UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e., the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length.
However, no references to the algorithm itself are given.
Google brought me this sample algorithm [2]. Probably it's worth to implement something like that and include into utf8-string if it's not already there.
1. http://www.ietf.org/rfc/rfc3629.txt 2. http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html
Something like this? (code below) Algorithm is trivial — check for impossible bytes combinations. If there is no such bytes, pairs etc. byte sequence is probably UTF8 encoded string. But problem not with decoding unicode strings i.e. not with functions like fromUnicode :: [Word8] -> [Char] but with encoding of string. Char represent unicode symbol, and thus everything OK at this point. However unix system calls know nothing about unicode and accept (char*) or [Word8] in haskell terminology. And conversion from [Char] to [Word8] is problem. It arise whenever haskell need to pass some string to outside world. Currently Char simply truncated to one byte regardless of its value. Its because of that `encode' function is needed. Not only executeFile affected.
import Control.Monad import Data.Word import Data.Bits import Data.Maybe
is11,is10,is0x :: Word8 -> Bool is11 b = (b `shiftR` 6) == 3 is10 b = (b `shiftR` 6) == 2 is0x b = b < 128
-- Test if pair allowed in UTF8 encoded string. validPair :: Word8 -> Word8 -> Maybe Word8 validPair a b = if (b < 254) && not ((is0x a && is10 b) || (is11 a && (not $ is10 b))) then Just b else Nothing
-- Check if sequence of bytes UTF8 encoded string. Note that this -- check is probabilistic. If function returns False this string is -- not UTF8. If it return True string still may fail to decode. isUTF8 :: [Word8] -> Bool isUTF8 = isJust . foldM validPair 0

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On Thu, Jan 15, 2009 at 11:04 AM, Khudyakov Alexey wrote:
On Thursday 15 January 2009 16:53:49 Roman Cheplyaka wrote:
RFC 3629 [1] states:
o UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e., the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length.
However, no references to the algorithm itself are given.
Google brought me this sample algorithm [2]. Probably it's worth to implement something like that and include into utf8-string if it's not already there.
1. http://www.ietf.org/rfc/rfc3629.txt 2. http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html
Something like this? (code below) Algorithm is trivial — check for impossible bytes combinations. If there is no such bytes, pairs etc. byte sequence is probably UTF8 encoded string.
But problem not with decoding unicode strings i.e. not with functions like fromUnicode :: [Word8] -> [Char] but with encoding of string. Char represent unicode symbol, and thus everything OK at this point. However unix system calls know nothing about unicode and accept (char*) or [Word8] in haskell terminology.
And conversion from [Char] to [Word8] is problem. It arise whenever haskell need to pass some string to outside world. Currently Char simply truncated to one byte regardless of its value. Its because of that `encode' function is needed. Not only executeFile affected.
import Control.Monad import Data.Word import Data.Bits import Data.Maybe
is11,is10,is0x :: Word8 -> Bool is11 b = (b `shiftR` 6) == 3 is10 b = (b `shiftR` 6) == 2 is0x b = b > -- Test if pair allowed in UTF8 encoded string. validPair :: Word8 -> Word8 -> Maybe Word8 validPair a b = if (b > (is11 a && (not $ is10 b))) then Just b else Nothing
-- Check if sequence of bytes UTF8 encoded string. Note that this -- check is probabilistic. If function returns False this string is -- not UTF8. If it return True string still may fail to decode. isUTF8 :: [Word8] -> Bool isUTF8 = isJust . foldM validPair 0
Perhaps we're over-thinking all this. Is it a problem in any way to run encodeString over a String that is just normal ASCII (that is, no funky Unicode)? Eric: could we just mindlessly call encodeString on everything going into spawn/safeSpawn? - -- gwern -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEAREKAAYFAklvbIoACgkQvpDo5Pfl1oIGOACfQoSjID/uj/UqFLcFrnAd1m1X nWIAnRkfzdTP70bhKB5eMM37/E4EryH4 =4no0 -----END PGP SIGNATURE-----

* Gwern Branwen
Perhaps we're over-thinking all this. Is it a problem in any way to run encodeString over a String that is just normal ASCII (that is, no funky Unicode)?
Eric: could we just mindlessly call encodeString on everything going into spawn/safeSpawn?
Because user is free to have 8-bit locale? -- Roman I. Cheplyaka :: http://ro-che.info/ "Don't let school get in the way of your education." - Mark Twain

On 2009 Jan 15, at 12:04, Gwern Branwen wrote:
Perhaps we're over-thinking all this. Is it a problem in any way to run encodeString over a String that is just normal ASCII (that is, no funky Unicode)?
AFAIK the only reason this is an issue at all is the desire to keep the dependencies of the core minimal; there is no technical reason otherwise for leaving the core's spawn as is. So the real question is whether we want to make the core depend on utf8-string or come up with some ugly hack that lets core use the ASCII-only version while contrib modules can use the UTF8 one. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH
participants (5)
-
Brandon S. Allbery KF8NH
-
Gwern Branwen
-
Khudyakov Alexey
-
Roman Cheplyaka
-
Sean Escriva