Re: [Haskell-cafe] strange behavior in Text.Regex.Posix

23 Jan 2007

      John MacFarlane wrote:
...
Can anyone help me understand this odd behavior in Text.Regex.Posix (GHC 6.6)?
Prelude Text.Regex.Posix Text.Regex> subRegex (mkRegex "\\^") "he\350llo" "@"
"he@llo"
Why does /\^/ match \350 here?  Generally Text.Regex.Posix seems to work
fine with unicode characters.  For example, \350 is treated as a single
character here:
Prelude Text.Regex.Posix Text.Regex> subRegex (mkRegex "e.l") "he\350llo" "@"
"h@lo"
The problem is specific to \350 and doesn't happen with, say, \351:
Prelude Text.Regex> subRegex (mkRegex "\\^") "he\351llo" "@"
"he\351llo"
Is this a bug, or just something I'm not understanding?
John
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
The Text.Regex API calls the regex-posix backend in Text.Regex.Posix which hands
off the matching to the (very slow) posix c library.

And this library does not know unicode from a hole in the ground -- all Char are
truncated to a single byte:

chr (ord '\350' `mod` 256) is '^'

Thus your pattern, which matches the character '^' will match '\350'.

http://darcs.haskell.org/packages/
http://darcs.haskell.org/packages/regex-unstable/

For a full Char matching regex backend you should get regex-parsec.  The
regex-dfa backend has problems which I have not uploaded the fix to.

The regex-pcre backend ought to handle UTF8 -- but you have to handle the
conversion to UTF8, for which Data.ByteString will come in handy.

The unstable library regex-tdfa is much faster then regex-parsec and is more
POSIX compliant than regex-posix.  It should go stable within a week.

Re: [Haskell-cafe] strange behavior in Text.Regex.Posix

Chris Kuklewicz