regex and Unicode - Haskell-Cafe - Haskell.org

newer
many beautiful things

regex and Unicode

older
Summer 2017 Functional Programming...

Brian Sammon

8 Sep 2016 8 Sep '16

1:21 a.m.

I tried to write a program using Text.Regex.PCRE to search through a UTF8-encoded document. It appears that the presence of non-breaking-space characters (code point 160) triggers some weird behavior in my program. This is using the Debian stable(Jessie) packages of ghc 7.6.3 and libraries. Now I find myself at a fork in the road, not sure which direction to head in. Do I: 1) Continue looking (or get help with looking) for bugs in my code? (I have this reduced to a pretty small test case) 2) Assemble a bug-report against debian? 3) Assemble a bug-report against Text.Regex.PCRE (or Text.Regex.Base) for "upstream" 4) Uninstall Text.Regex.PCRE (and/or some other packages) and switch to something that works with Unicode/UTF8? Any ideas?

Reply

Sign in to reply online Use email software

Show replies by date

Francesco Ariis

8 Sep 8 Sep

2:46 a.m.

On Wed, Sep 07, 2016 at 09:21:43PM -0400, Brian Sammon wrote:

I tried to write a program using Text.Regex.PCRE to search through a UTF8-encoded document. It appears that the presence of non-breaking-space characters (code point 160) triggers some weird behavior in my program.

This is using the Debian stable(Jessie) packages of ghc 7.6.3 and libraries.

Now I find myself at a fork in the road, not sure which direction to head in.

Do I: 1) Continue looking (or get help with looking) for bugs in my code? (I have this reduced to a pretty small test case) 2) Assemble a bug-report against debian? 3) Assemble a bug-report against Text.Regex.PCRE (or Text.Regex.Base) for "upstream" 4) Uninstall Text.Regex.PCRE (and/or some other packages) and switch to something that works with Unicode/UTF8?

I am pretty sure pcre-light has an utf8 mode. Is swapping the two modules to check if bug persists feasible?

Reply

Sign in to reply online Use email software

Peter Simons

6:42 a.m.

Hi Brian,

I tried to write a program using Text.Regex.PCRE to search through a UTF8-encoded document. It appears that the presence of non-breaking-space characters (code point 160) triggers some weird behavior in my program.

I seem to recall that regex-pcre simply binds to the system's pcre library and effectively lets that library do all the work. Now, libpcre has full Unicode support, but that needs to be enabled at compile time to be available. I believe "--enable-unicode-properties" is the appropriate configure flag, but I don't know for sure. Anyway, my point is that your system's libpcre may or may not have that feature enabled. If it does not, then regex-pcre won't be able to deal with Unicode characters properly and that issue should be reported to Debian. If your system library *has* Unicode support, then this issue might be a caused by a bug in regex-pcre (unlikely) or in your code that uses it (more likely). I hope this helps, Peter

Reply

Sign in to reply online Use email software

Joachim Durchholz

8:30 a.m.

Am 08.09.2016 um 08:42 schrieb Peter Simons:

Anyway, my point is that your system's libpcre may or may not have that feature enabled. If it does not, then regex-pcre won't be able to deal with Unicode characters properly and that issue should be reported to Debian.

I'd be surprised if that's what's happening; Debian has been all-Unicode for quite a while now, and PCRE libraries are high-profile enough that lack of Unicode support would have been reported long ago. I dimly recall that there are multiple PCRE libraries in Debian, with subtle differences. I could imagine that one of them operates in 8-bit mode by default and wants a flag somewhere to activate Unicode processing.

Reply

Sign in to reply online Use email software

Brian Sammon

8:53 a.m.

I have a test-case here if anyone would like to look at it and give me a "Works For Me" or "Your code's wrong" or something. The program should (i.e. I expect it to) report that it found "Page Title", but in my case, it says it found "age Title'".

Reply

Sign in to reply online Use email software

Francesco Ariis

8:52 a.m.

On Thu, Sep 08, 2016 at 04:53:13AM -0400, Brian Sammon wrote:

I have a test-case here if anyone would like to look at it and give me a "Works For Me" or "Your code's wrong" or something.

The program should (i.e. I expect it to) report that it found "Page Title", but in my case, it says it found "age Title'".

$ runghc test.hs title is |Page Title| Debian stable (Jessie). Maybe locale is interfering with something? Mine is LANG=en_GB.UTF-8 LANGUAGE=en_GB:en

Reply

Sign in to reply online Use email software

Brian Sammon

9:14 a.m.

On Thu, 8 Sep 2016 10:52:46 +0200 Francesco Ariis wrote:

$ runghc test.hs title is |Page Title|

Debian stable (Jessie). Maybe locale is interfering with something?

Hmm... are you running debian-packaged ghc 7.6.3 , as I am?

Reply

Sign in to reply online Use email software

Francesco Ariis

9:36 a.m.

On Thu, Sep 08, 2016 at 05:14:11AM -0400, Brian Sammon wrote:

On Thu, 8 Sep 2016 10:52:46 +0200 Francesco Ariis wrote:

...
$ runghc test.hs title is |Page Title|

Debian stable (Jessie). Maybe locale is interfering with something?

Hmm... are you running debian-packaged ghc 7.6.3 , as I am?

No, GHC 8.0.1, regex-pcre-0.94.4.

Reply

Sign in to reply online Use email software

Francesco Ariis

9:45 a.m.

On Thu, Sep 08, 2016 at 11:36:58AM +0200, Francesco Ariis wrote:

No, GHC 8.0.1, regex-pcre-0.94.4.

I'll add that documentation states Using the provided CompOption and ExecOption values and if configUTF8 is True, then you might be able to send UTF8 encoded ByteStrings to PCRE and get sensible results. This is currently untested. Is configUTF8 True on your system (on mine it is)?

Reply

Sign in to reply online Use email software

MarLinn

10:15 a.m.

I have a test-case here if anyone would like to look at it and give me a "Works For Me" or "Your code's wrong" or something. I got the same wrong output as you.

regex-base: 0.93.2 regex-pcre: 0.94.4 GHC: 7.10.3 LANG: en_GB.UTF-8 LANGUAGE: en_GB:en system: debian 8.5 (jessie) configUTF8: returns True pcre lib: 8.35 2014-04-04

Reply

Sign in to reply online Use email software

Brian Sammon

7:48 p.m.

On Wed, 7 Sep 2016 21:21:43 -0400 Brian Sammon wrote:

I tried to write a program using Text.Regex.PCRE to search through a UTF8-> encoded document. It appears that the presence of non-breaking-space characters (code point 160) triggers some weird behavior in my program.

Not sure why I didn't find it with earlier google searches, but today I found this rather interesting thread from a few years back on haskell-cafe: http://haskell-cafe.haskell.narkive.com/OU9UhI0y/ It describes a problem someone was having with GHC 7 and passing strings to Text.Regex.PCRE. There is also a suggested workaround and an explanation that seems to be a very good match for the off-by-one error I was seeing. I can't tell (from that thread or elsewhere on google) if/when/how this bug was fixed, but based on other responses here, it sounds like it was fixed by the time of GHC 8.

Reply

Sign in to reply online Use email software

3233

Age (days ago)

3233

Last active (days ago)

Download

10 comments

5 participants

tags

participants (5)

Brian Sammon
Francesco Ariis
Joachim Durchholz
MarLinn
Peter Simons