Re: [xmonad] spawn functions are not unicode safe

15 Jan 2009

      On Thursday 15 January 2009 16:53:49 Roman Cheplyaka wrote:
...
RFC 3629 [1] states:
o  UTF-8 strings can be fairly reliably recognized as such by a
      simple algorithm, i.e., the probability that a string of
      characters in any other encoding appears as valid UTF-8 is low,
      diminishing with increasing string length.
However, no references to the algorithm itself are given.
Google brought me this sample algorithm [2].
Probably it's worth to implement something like that and include into
utf8-string if it's not already there.
1. http://www.ietf.org/rfc/rfc3629.txt
  2. http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html
Something like this? (code below) Algorithm is trivial — check for impossible 
bytes combinations. If there is no such bytes, pairs etc. byte sequence is 
probably UTF8 encoded string.

But problem not with decoding unicode strings i.e. not with functions like 
fromUnicode :: [Word8] -> [Char]
but with encoding of string. Char represent unicode symbol, and thus 
everything OK at this point. However unix system calls know nothing about 
unicode and accept (char*) or [Word8] in haskell terminology. 

And conversion from [Char] to [Word8] is problem. It arise whenever haskell 
need to pass some string to outside world.  Currently Char simply truncated 
to one byte regardless of its value. Its because of that `encode' function is 
needed. Not only executeFile affected.
...
import Control.Monad
import Data.Word
import Data.Bits
import Data.Maybe
is11,is10,is0x :: Word8 -> Bool
is11 b = (b `shiftR` 6) == 3
is10 b = (b `shiftR` 6) == 2
is0x b = b < 128
-- Test if pair allowed in UTF8 encoded string. 
validPair :: Word8 -> Word8 -> Maybe Word8 
validPair a b = if (b < 254) && not ((is0x a && is10 b) ||
                                     (is11 a && (not $ is10 b)))
                then Just b
                else Nothing
-- Check if sequence of bytes UTF8 encoded string. Note that this
-- check is probabilistic. If function returns False this string is
-- not UTF8. If it return True string still may fail to decode.
isUTF8 :: [Word8] -> Bool
isUTF8 = isJust . foldM validPair 0

Re: [xmonad] spawn functions are not unicode safe

Khudyakov Alexey