Re: Bug in Text.Regex.PCRE - do not accept national symbol in pattern

Alexandr, Thanks for sending me this question about unicode and regex-pcre. I will share with the mailing list. This is an encoding issue. From the haddock documentation for regex-pcre: http://hackage.haskell.org/packages/archive/regex-pcre/0.94.1/doc/html/Text-... "Using the provided CompOption and ExecOption values and if configUTF8 is True, then you might be able to send UTF8 encoded ByteStrings to PCRE and get sensible results. This is currently untested." This is a literate Haskell post so you can save with file extension ".lhs" and pass this to ghci. The answer is a combination of "man 3 pcre" and the haddock documentation for haskell-pcre and using makeRegexOpts. I show one possible way to use utf8 below, via the 'utf8-string' package from hackage. There are other ways to use the same package and other packages available.
{-# LANGUAGE FlexibleContexts #-} import Text.Regex.PCRE hiding ((=~)) --import Text.Regex.PCRE.Wrap(configUtf8) import qualified Data.ByteString.UTF8 as U import qualified System.IO.UTF8 as U import Data.Bits((.|.))
Here I copied the original source for (=~) from http://hackage.haskell.org/packages/archive/regex-pcre/0.94.1/doc/html/Text-... I then editied it to create a custom (=~) that defines its own options. You can add compNoUTF8Check for performance/safety tradeoff (see man 3 pcre).
makeRegexUtf8 :: (RegexMaker Regex CompOption ExecOption source) => source -> Regex makeRegexUtf8 r = let co = defaultCompOpt .|. compUTF8 -- need compUTF8 flag when using makeRegexOpts -- co = defaultCompOpt .|. compUTF8 .|. compNoUTF8Check -- in makeRegexOpts co defaultExecOpt r
(=~) :: (RegexMaker Regex CompOption ExecOption source,RegexContext Regex source1 target) => source1 -> source -> target (=~) x r = let q = makeRegexUtf8 r in match q x
If you are going to use the same pattern against many different texts then you should NOT use (=~). Instead you should call makeRegexUtf8 and reuse the resulting Regex value. Otherwise you have to recompile the pattern for each match performed. Below, 're_test' was changed internally to convert the [Char] into a ByteString holding a utf8 encoded representation. The 'makeRegexOpts' and 'match' calls will then run the libpcre routines directly on the the memory that backs the ByteString. This is an optimal was to use the library.
re_test :: String -> String -> Bool re_test re str = (U.fromString str) =~ (U.fromString re)
-- test for national symbols main = do putStrLn $ "If this line ends with True then your libpcre has UTF8 support: " ++ show configUTF8 let pattern1,pattern2,pattern3,text :: String pattern1 = "^п.*" pattern2 = "^..ив.*" pattern3 = "^......$" text = "привет" U.putStrLn $ "The 3 patterns are: " ++ pattern1 ++ ", " ++ pattern2 ++ ", and "++pattern3 U.putStrLn $ "The text to be matched is " ++ text putStrLn $ "The length of the text to be matched is "++show (length text) putStrLn "All three lines below should print True" print $ re_test pattern1 text print $ re_test pattern2 text print $ re_test pattern3 text
The output when I run this on my machine is If this line ends with True then your libpcre has UTF8 support: True The 3 patterns are: ^п.*, ^..ив.*, and ^......$ The text to be matched is привет The length of the text to be matched is 6 All three lines below should print True True True True
participants (1)
-
ChrisK