
On Tue, Aug 21, 2012 at 05:50:44PM -0300, José Romildo Malaquias wrote:
On Tue, Aug 21, 2012 at 04:05:28PM +0100, Chris Kuklewicz wrote:
I do not have time to test this myself right now. But I will unravel my code a bit for you.
By November 2011 it worked without problems in my application. Now that I have resumed developping the application, I have been faced with this behaviour. As it used to work before, I believe it is a bug in regex-pcre or libpcre.
I believe it may be problem in String <-> ByteString conversion. The "base" library may have changed and your LOCALE information may be different or may be being used differently by "base".
The (temporary) workaround I found is to convert the strings to byte-strings before matching, and then convert the results back to strings. With byte-strings it works well.
That is an excellent sign that it is your LOCALE settings being picked up by GHC's "base" package, see explanation below. [...] I have written an application to test those things. There are 2 source files: test.hs and seestr.c, which are attached.
The test does the following:
1. shows the getForeignEncoding
2. uses a C function to show the characters from a String (using withCString) and from a ByteString (using useAsCString)
3. matches a PCRE regular expression using String and ByteString
The test is run twice, with different LANG settings, and its output follows. [...] As can be seen, regular expression matching does not work with en_US.UTF-8. But it works with en_US.ISO-8859-1.
The test shows that withCString is working as expected too. This may suggest the problem is really with regex-pcre.
The previous tests were run on an gentoo linux with ghc-7.4.1. I have also run the tests on Fedora 17 with ghc-7.0.4, which does not have the bug. The sources are attached. The tests output follows: $ LANG=en_US.ISO-8859-1 && ./test testing with String code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 testing with ByteString code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 regex : pa�s:(.*) text : pa�s:Brasil String match : [["pa\237s:Brasil","Brasil"]] ByteString match : [["pa\237s:Brasil","Brasil"]] $ LANG=en_US.UTF-8 && ./test testing with String code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 testing with ByteString code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 regex : país:(.*) text : país:Brasil String match : [["pa\237s:Brasil","Brasil"]] ByteString match : [["pa\237s:Brasil","Brasil"]] Clearly witchCString has changed from ghc-7.0.4 to ghc-7.4.1. It seems that With ghc-7.0.4 withCString does not obey the UTF-8 locale and generates a latin1 C string. Regards, Romildo