regex and Unicode

28 Aug 2016

      I tried to write a program using Text.Regex.PCRE to search through a UTF8-encoded document.  It appears that the presence of non-breaking-space characters (Charpoint 160) triggers some weird behavior in my program.

This is using the Debian stable(Jessie) packages of ghc 7.6.3 and libraries.

Now I find myself at a fork in the road, not sure which direction to head in.

Do I: 
1) Continue looking (or get help with looking) for bugs in my code?  (I
    have this reduced to a pretty small test case)
2) Assemble a bug-report against debian?
3) Assemble a bug-report against Text.Regex.PCRE (or Text.Regex.Base) for
    "upstream"
4) Uninstall Text.Regex.PCRE (and/or some other packages) and switch to
    something that works with Unicode/UTF8?

Any ideas?

Brian Sammon

Brian Sammon

tags

participants (1)