Re: [Haskell-cafe] regex-pcre is not working with UTF-8

22 Aug 2012

      On Tue, Aug 21, 2012 at 05:50:44PM -0300, José Romildo Malaquias wrote:
...
On Tue, Aug 21, 2012 at 04:05:28PM +0100, Chris Kuklewicz wrote:
...
I do not have time to test this myself right now.  But I will unravel my code a
bit for you.
...
By November 2011 it worked without problems in my application. Now that
I have resumed developping the application, I have been faced with this
behaviour. As it used to work before, I believe it is a bug in
regex-pcre or libpcre.
I believe it may be problem in String <-> ByteString conversion.  The "base"
library may have changed and your LOCALE information may be different or may be
being used differently by "base".
...
The (temporary) workaround I found is to convert the strings to
byte-strings before matching, and then convert the results back to
strings. With byte-strings it works well.
That is an excellent sign that it is your LOCALE settings being picked up by
GHC's "base" package, see explanation below.
[...]
I have written an application to test those things. There are 2 source
files: test.hs and seestr.c, which are attached.
The test does the following:
1. shows the getForeignEncoding
2. uses a C function to show the characters from a String (using
      withCString) and from a ByteString (using useAsCString)
3. matches a PCRE regular expression using String and ByteString
The test is run twice, with different LANG settings, and its output
follows.
[...]
As can be seen, regular expression matching does not work with
en_US.UTF-8. But it works with en_US.ISO-8859-1.
The test shows that withCString is working as expected too. This
may suggest the problem is really with regex-pcre.
The previous tests were run on an gentoo linux with ghc-7.4.1.

I have also run the tests on Fedora 17 with ghc-7.0.4, which does not
have the bug. The sources are attached. The tests output follows:

   $ LANG=en_US.ISO-8859-1 && ./test 
   testing with String
   code:       70, char: p
   code:       61, char: a
   code: ffffffed, char: 
   code:       73, char: s
   result: 4

   testing with ByteString
   code:       70, char: p
   code:       61, char: a
   code: ffffffed, char: 
   code:       73, char: s
   result: 4

   regex            : pa�s:(.*)
   text             : pa�s:Brasil
   String match     : [["pa\237s:Brasil","Brasil"]]
   ByteString match : [["pa\237s:Brasil","Brasil"]]

   $ LANG=en_US.UTF-8 && ./test
   testing with String
   code:       70, char: p
   code:       61, char: a
   code: ffffffed, char: 
   code:       73, char: s
   result: 4

   testing with ByteString
   code:       70, char: p
   code:       61, char: a
   code: ffffffed, char: 
   code:       73, char: s
   result: 4

   regex            : país:(.*)
   text             : país:Brasil
   String match     : [["pa\237s:Brasil","Brasil"]]
   ByteString match : [["pa\237s:Brasil","Brasil"]]

Clearly witchCString has changed from ghc-7.0.4 to ghc-7.4.1. It seems
that With ghc-7.0.4 withCString does not obey the UTF-8 locale and
generates a latin1 C string.

Regards,

Romildo