
On Thursday 15 January 2009 16:53:49 Roman Cheplyaka wrote:
RFC 3629 [1] states:
o UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e., the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length.
However, no references to the algorithm itself are given.
Google brought me this sample algorithm [2]. Probably it's worth to implement something like that and include into utf8-string if it's not already there.
1. http://www.ietf.org/rfc/rfc3629.txt 2. http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html
Something like this? (code below) Algorithm is trivial — check for impossible bytes combinations. If there is no such bytes, pairs etc. byte sequence is probably UTF8 encoded string. But problem not with decoding unicode strings i.e. not with functions like fromUnicode :: [Word8] -> [Char] but with encoding of string. Char represent unicode symbol, and thus everything OK at this point. However unix system calls know nothing about unicode and accept (char*) or [Word8] in haskell terminology. And conversion from [Char] to [Word8] is problem. It arise whenever haskell need to pass some string to outside world. Currently Char simply truncated to one byte regardless of its value. Its because of that `encode' function is needed. Not only executeFile affected.
import Control.Monad import Data.Word import Data.Bits import Data.Maybe
is11,is10,is0x :: Word8 -> Bool is11 b = (b `shiftR` 6) == 3 is10 b = (b `shiftR` 6) == 2 is0x b = b < 128
-- Test if pair allowed in UTF8 encoded string. validPair :: Word8 -> Word8 -> Maybe Word8 validPair a b = if (b < 254) && not ((is0x a && is10 b) || (is11 a && (not $ is10 b))) then Just b else Nothing
-- Check if sequence of bytes UTF8 encoded string. Note that this -- check is probabilistic. If function returns False this string is -- not UTF8. If it return True string still may fail to decode. isUTF8 :: [Word8] -> Bool isUTF8 = isJust . foldM validPair 0