
John MacFarlane wrote:
Can anyone help me understand this odd behavior in Text.Regex.Posix (GHC 6.6)?
Prelude Text.Regex.Posix Text.Regex> subRegex (mkRegex "\\^") "he\350llo" "@" "he@llo"
Why does /\^/ match \350 here? Generally Text.Regex.Posix seems to work fine with unicode characters. For example, \350 is treated as a single character here:
Prelude Text.Regex.Posix Text.Regex> subRegex (mkRegex "e.l") "he\350llo" "@" "h@lo"
The problem is specific to \350 and doesn't happen with, say, \351:
Prelude Text.Regex> subRegex (mkRegex "\\^") "he\351llo" "@" "he\351llo"
Is this a bug, or just something I'm not understanding?
John
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
The Text.Regex API calls the regex-posix backend in Text.Regex.Posix which hands off the matching to the (very slow) posix c library. And this library does not know unicode from a hole in the ground -- all Char are truncated to a single byte: chr (ord '\350' `mod` 256) is '^' Thus your pattern, which matches the character '^' will match '\350'. http://darcs.haskell.org/packages/ http://darcs.haskell.org/packages/regex-unstable/ For a full Char matching regex backend you should get regex-parsec. The regex-dfa backend has problems which I have not uploaded the fix to. The regex-pcre backend ought to handle UTF8 -- but you have to handle the conversion to UTF8, for which Data.ByteString will come in handy. The unstable library regex-tdfa is much faster then regex-parsec and is more POSIX compliant than regex-posix. It should go stable within a week.