
On Thu, Aug 23, 2012 at 08:59:52AM -0300, José Romildo Malaquias wrote:
Hello.
I think I have an explanation for the problem with regex-pcre, ghc-7.4.2 and UTF Strings.
The Text.Regex.PCRE.String module uses the withCString and withCStringLen from the module Foreign.C.String to pass a Haskell string to the C library pcre functions that compile regular expressions, and execute regular expressions to match some text.
Recent versions of ghc have withCString and withCStringLen definitions that uses the current system locale to define the marshalling of a Haskell string into a NUL terminated C string using temporary storage.
With a UTF-8 locale the length of the C string will be greater than the length of the corresponding Haskell string in the presence with characters outside of the ASCII range. Therefore positions of corresponding characters in both strings do not match.
In order to compute matching positions, regex-pcre functions use C strings. But to compute matching strings they use those positions with Haskell strings.
That gives the mismatch shown earlier and repeated here with the attached program run on a system with a UTF-8 locale:
$ LANG=en_US.UTF-8 && ./test1 getForeignEncoding: UTF-8
regex : país:(.*):(.*) text : país:Brasília:Brasil String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))]) String match : [["pa\237s:Bras\237lia:Brasil","ras\237lia:B","asil"]]
$ LANG=en_US.ISO-8859-1 && ./test1 getForeignEncoding: ISO-8859-1
regex : pa�s:(.*):(.*) text : pa�s:Bras�lia:Brasil String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))]) String match : [["pa\237s:Bras\237lia:Brasil","Bras\237lia","Brasil"]]
I see two ways of fixing this bug:
1. make the matching functions compute the text using the C string and the positions calculated by the C function, and convert the text back to a Haskell string.
2. map the positions in the C string (if possible) to the corresponding positions in the Haskell string; this way the current definitions of the matching functions returning text will just work.
I hope this would help fixing the issue.
I have a fix for this bug and it would be nice if others take a look at it and see if it is ok. It is based on the second way presented above. Romildo