
I tried to write a program using Text.Regex.PCRE to search through a UTF8-encoded document. It appears that the presence of non-breaking-space characters (code point 160) triggers some weird behavior in my program. This is using the Debian stable(Jessie) packages of ghc 7.6.3 and libraries. Now I find myself at a fork in the road, not sure which direction to head in. Do I: 1) Continue looking (or get help with looking) for bugs in my code? (I have this reduced to a pretty small test case) 2) Assemble a bug-report against debian? 3) Assemble a bug-report against Text.Regex.PCRE (or Text.Regex.Base) for "upstream" 4) Uninstall Text.Regex.PCRE (and/or some other packages) and switch to something that works with Unicode/UTF8? Any ideas?

On Wed, Sep 07, 2016 at 09:21:43PM -0400, Brian Sammon wrote:
I tried to write a program using Text.Regex.PCRE to search through a UTF8-encoded document. It appears that the presence of non-breaking-space characters (code point 160) triggers some weird behavior in my program.
This is using the Debian stable(Jessie) packages of ghc 7.6.3 and libraries.
Now I find myself at a fork in the road, not sure which direction to head in.
Do I: 1) Continue looking (or get help with looking) for bugs in my code? (I have this reduced to a pretty small test case) 2) Assemble a bug-report against debian? 3) Assemble a bug-report against Text.Regex.PCRE (or Text.Regex.Base) for "upstream" 4) Uninstall Text.Regex.PCRE (and/or some other packages) and switch to something that works with Unicode/UTF8?
I am pretty sure pcre-light has an utf8 mode. Is swapping the two modules to check if bug persists feasible?

Hi Brian,
I tried to write a program using Text.Regex.PCRE to search through a UTF8-encoded document. It appears that the presence of non-breaking-space characters (code point 160) triggers some weird behavior in my program.
I seem to recall that regex-pcre simply binds to the system's pcre library and effectively lets that library do all the work. Now, libpcre has full Unicode support, but that needs to be enabled at compile time to be available. I believe "--enable-unicode-properties" is the appropriate configure flag, but I don't know for sure. Anyway, my point is that your system's libpcre may or may not have that feature enabled. If it does not, then regex-pcre won't be able to deal with Unicode characters properly and that issue should be reported to Debian. If your system library *has* Unicode support, then this issue might be a caused by a bug in regex-pcre (unlikely) or in your code that uses it (more likely). I hope this helps, Peter

Am 08.09.2016 um 08:42 schrieb Peter Simons:
Anyway, my point is that your system's libpcre may or may not have that feature enabled. If it does not, then regex-pcre won't be able to deal with Unicode characters properly and that issue should be reported to Debian.
I'd be surprised if that's what's happening; Debian has been all-Unicode for quite a while now, and PCRE libraries are high-profile enough that lack of Unicode support would have been reported long ago. I dimly recall that there are multiple PCRE libraries in Debian, with subtle differences. I could imagine that one of them operates in 8-bit mode by default and wants a flag somewhere to activate Unicode processing.

I have a test-case here if anyone would like to look at it and give me a "Works For Me" or "Your code's wrong" or something. The program should (i.e. I expect it to) report that it found "Page Title", but in my case, it says it found "age Title'".

On Thu, Sep 08, 2016 at 04:53:13AM -0400, Brian Sammon wrote:
I have a test-case here if anyone would like to look at it and give me a "Works For Me" or "Your code's wrong" or something.
The program should (i.e. I expect it to) report that it found "Page Title", but in my case, it says it found "age Title'".
$ runghc test.hs title is |Page Title| Debian stable (Jessie). Maybe locale is interfering with something? Mine is LANG=en_GB.UTF-8 LANGUAGE=en_GB:en

On Thu, Sep 08, 2016 at 05:14:11AM -0400, Brian Sammon wrote:
On Thu, 8 Sep 2016 10:52:46 +0200 Francesco Ariis
wrote: $ runghc test.hs title is |Page Title|
Debian stable (Jessie). Maybe locale is interfering with something?
Hmm... are you running debian-packaged ghc 7.6.3 , as I am?
No, GHC 8.0.1, regex-pcre-0.94.4.

On Thu, Sep 08, 2016 at 11:36:58AM +0200, Francesco Ariis wrote:
No, GHC 8.0.1, regex-pcre-0.94.4.
I'll add that documentation states Using the provided CompOption and ExecOption values and if configUTF8 is True, then you might be able to send UTF8 encoded ByteStrings to PCRE and get sensible results. This is currently untested. Is configUTF8 True on your system (on mine it is)?

I have a test-case here if anyone would like to look at it and give me a "Works For Me" or "Your code's wrong" or something. I got the same wrong output as you.
regex-base: 0.93.2 regex-pcre: 0.94.4 GHC: 7.10.3 LANG: en_GB.UTF-8 LANGUAGE: en_GB:en system: debian 8.5 (jessie) configUTF8: returns True pcre lib: 8.35 2014-04-04

On Wed, 7 Sep 2016 21:21:43 -0400
Brian Sammon
I tried to write a program using Text.Regex.PCRE to search through a UTF8-> encoded document. It appears that the presence of non-breaking-space characters (code point 160) triggers some weird behavior in my program.
Not sure why I didn't find it with earlier google searches, but today I found this rather interesting thread from a few years back on haskell-cafe: http://haskell-cafe.haskell.narkive.com/OU9UhI0y/ It describes a problem someone was having with GHC 7 and passing strings to Text.Regex.PCRE. There is also a suggested workaround and an explanation that seems to be a very good match for the off-by-one error I was seeing. I can't tell (from that thread or elsewhere on google) if/when/how this bug was fixed, but based on other responses here, it sounds like it was fixed by the time of GHC 8.
participants (5)
-
Brian Sammon
-
Francesco Ariis
-
Joachim Durchholz
-
MarLinn
-
Peter Simons