Re: [Haskell-cafe] regex-pcre is not working with UTF-8

On Tue, Aug 21, 2012 at 05:50:44PM -0300, José Romildo Malaquias wrote:
On Tue, Aug 21, 2012 at 04:05:28PM +0100, Chris Kuklewicz wrote:
I do not have time to test this myself right now. But I will unravel my code a bit for you.
By November 2011 it worked without problems in my application. Now that I have resumed developping the application, I have been faced with this behaviour. As it used to work before, I believe it is a bug in regex-pcre or libpcre.
I believe it may be problem in String <-> ByteString conversion. The "base" library may have changed and your LOCALE information may be different or may be being used differently by "base".
The (temporary) workaround I found is to convert the strings to byte-strings before matching, and then convert the results back to strings. With byte-strings it works well.
That is an excellent sign that it is your LOCALE settings being picked up by GHC's "base" package, see explanation below. [...] I have written an application to test those things. There are 2 source files: test.hs and seestr.c, which are attached.
The test does the following:
1. shows the getForeignEncoding
2. uses a C function to show the characters from a String (using withCString) and from a ByteString (using useAsCString)
3. matches a PCRE regular expression using String and ByteString
The test is run twice, with different LANG settings, and its output follows. [...] As can be seen, regular expression matching does not work with en_US.UTF-8. But it works with en_US.ISO-8859-1.
The test shows that withCString is working as expected too. This may suggest the problem is really with regex-pcre.
The previous tests were run on an gentoo linux with ghc-7.4.1. I have also run the tests on Fedora 17 with ghc-7.0.4, which does not have the bug. The sources are attached. The tests output follows: $ LANG=en_US.ISO-8859-1 && ./test testing with String code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 testing with ByteString code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 regex : pa�s:(.*) text : pa�s:Brasil String match : [["pa\237s:Brasil","Brasil"]] ByteString match : [["pa\237s:Brasil","Brasil"]] $ LANG=en_US.UTF-8 && ./test testing with String code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 testing with ByteString code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 regex : país:(.*) text : país:Brasil String match : [["pa\237s:Brasil","Brasil"]] ByteString match : [["pa\237s:Brasil","Brasil"]] Clearly witchCString has changed from ghc-7.0.4 to ghc-7.4.1. It seems that With ghc-7.0.4 withCString does not obey the UTF-8 locale and generates a latin1 C string. Regards, Romildo

Hello. I think I have an explanation for the problem with regex-pcre, ghc-7.4.2 and UTF Strings. The Text.Regex.PCRE.String module uses the withCString and withCStringLen from the module Foreign.C.String to pass a Haskell string to the C library pcre functions that compile regular expressions, and execute regular expressions to match some text. Recent versions of ghc have withCString and withCStringLen definitions that uses the current system locale to define the marshalling of a Haskell string into a NUL terminated C string using temporary storage. With a UTF-8 locale the length of the C string will be greater than the length of the corresponding Haskell string in the presence with characters outside of the ASCII range. Therefore positions of corresponding characters in both strings do not match. In order to compute matching positions, regex-pcre functions use C strings. But to compute matching strings they use those positions with Haskell strings. That gives the mismatch shown earlier and repeated here with the attached program run on a system with a UTF-8 locale: $ LANG=en_US.UTF-8 && ./test1 getForeignEncoding: UTF-8 regex : país:(.*):(.*) text : país:Brasília:Brasil String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))]) String match : [["pa\237s:Bras\237lia:Brasil","ras\237lia:B","asil"]] $ LANG=en_US.ISO-8859-1 && ./test1 getForeignEncoding: ISO-8859-1 regex : pa�s:(.*):(.*) text : pa�s:Bras�lia:Brasil String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))]) String match : [["pa\237s:Bras\237lia:Brasil","Bras\237lia","Brasil"]] I see two ways of fixing this bug: 1. make the matching functions compute the text using the C string and the positions calculated by the C function, and convert the text back to a Haskell string. 2. map the positions in the C string (if possible) to the corresponding positions in the Haskell string; this way the current definitions of the matching functions returning text will just work. I hope this would help fixing the issue. Regards, Romildo

On Thu, Aug 23, 2012 at 08:59:52AM -0300, José Romildo Malaquias wrote:
Hello.
I think I have an explanation for the problem with regex-pcre, ghc-7.4.2 and UTF Strings.
The Text.Regex.PCRE.String module uses the withCString and withCStringLen from the module Foreign.C.String to pass a Haskell string to the C library pcre functions that compile regular expressions, and execute regular expressions to match some text.
Recent versions of ghc have withCString and withCStringLen definitions that uses the current system locale to define the marshalling of a Haskell string into a NUL terminated C string using temporary storage.
With a UTF-8 locale the length of the C string will be greater than the length of the corresponding Haskell string in the presence with characters outside of the ASCII range. Therefore positions of corresponding characters in both strings do not match.
In order to compute matching positions, regex-pcre functions use C strings. But to compute matching strings they use those positions with Haskell strings.
That gives the mismatch shown earlier and repeated here with the attached program run on a system with a UTF-8 locale:
$ LANG=en_US.UTF-8 && ./test1 getForeignEncoding: UTF-8
regex : país:(.*):(.*) text : país:Brasília:Brasil String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))]) String match : [["pa\237s:Bras\237lia:Brasil","ras\237lia:B","asil"]]
$ LANG=en_US.ISO-8859-1 && ./test1 getForeignEncoding: ISO-8859-1
regex : pa�s:(.*):(.*) text : pa�s:Bras�lia:Brasil String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))]) String match : [["pa\237s:Bras\237lia:Brasil","Bras\237lia","Brasil"]]
I see two ways of fixing this bug:
1. make the matching functions compute the text using the C string and the positions calculated by the C function, and convert the text back to a Haskell string.
2. map the positions in the C string (if possible) to the corresponding positions in the Haskell string; this way the current definitions of the matching functions returning text will just work.
I hope this would help fixing the issue.
I have a fix for this bug and it would be nice if others take a look at it and see if it is ok. It is based on the second way presented above. Romildo
participants (1)
-
José Romildo Malaquias