A weird bug of regex-pcre

Attachment is the test text file.
And I tested my regexp as this:
Prelude> :m + Text.Regex.PCRE
Prelude Text.Regex.PCRE> z <- readFile "test.html"
Prelude Text.Regex.PCRE> let (b, m ,a, ss) = z =~ ".*? b
...
n of the Triumvirate</td>\r\n
From the value of b and m, it was weird that the matching was moved forward by 1 char ( the ss (sub matching) was even worse, 2 chars ). Rematch to a and so on gave correct results. It was only the first matching that was broken. Tested with regex-posix (with modified regexp), everything is OK.
$ ghc-pkg describe regex-pcre name: regex-pcre version: 0.94.4 id: regex-pcre-0.94.4-d45e00c9e113c7c9352d0785497e1dca license: BSD3 copyright: Copyright (c) 2006, Christopher Kuklewicz maintainer: TextRegexLazy@personal.mightyreason.com stability: Seems to work, passes a few tests homepage: http://hackage.haskell.org/package/regex-pcre package-url: http://code.haskell.org/regex-pcre/ synopsis: Replaces/Enhances Text.Regex description: The PCRE backend to accompany regex-base, see www.pcre.org category: Text author: Christopher Kuklewicz exposed: True exposed-modules: Text.Regex.PCRE Text.Regex.PCRE.Wrap Text.Regex.PCRE.String Text.Regex.PCRE.Sequence Text.Regex.PCRE.ByteString Text.Regex.PCRE.ByteString.Lazy hidden-modules: trusted: False import-dirs: /home/magicloud/.cabal/lib/regex-pcre-0.94.4/ghc-7.6.1 library-dirs: /home/magicloud/.cabal/lib/regex-pcre-0.94.4/ghc-7.6.1 hs-libraries: HSregex-pcre-0.94.4 extra-libraries: pcre extra-ghci-libraries: include-dirs: includes: depends: array-0.4.0.1-cbe8814e07792e8f0d66cac77a2c0b6b base-4.6.0.0-9108e251636b0c8499261c52a7809ea1 bytestring-0.10.0.1-11d4f52c4f4ed9833f768577b77050c5 containers-0.5.2.1-b183418bc7f43ce98b6916ef296c2669 regex-base-0.93.2-1ee07f806ad6b0c911226883d15b64f2 hugs-options: cc-options: ld-options: framework-dirs: frameworks: haddock-interfaces: /home/magicloud/.cabal/share/doc/regex-pcre-0.94.4/html/regex-pcre.haddock haddock-html: /home/magicloud/.cabal/share/doc/regex-pcre-0.94.4/html pkgroot: "/home/magicloud/.ghc/x86_64-linux-7.6.1" -- 竹密岂妨流水过 山高哪阻野云飞 And for G+, please use magiclouds#gmail.com.

On Tue, Dec 18, 2012 at 02:28:26PM +0800, Magicloud Magiclouds wrote:
Attachment is the test text file. And I tested my regexp as this:
Prelude> :m + Text.Regex.PCRE Prelude Text.Regex.PCRE> z <- readFile "test.html" Prelude Text.Regex.PCRE> let (b, m ,a, ss) = z =~ ".*?
b ... n of the Triumvirate</td>\r\n
David Rapoza</td>\r\n \r\n <i>Return to Ravnica</i>\r\n </td>\r\n 10/31/2012</td>\r\n </tr><tr>\r\n <" Prelude Text.Regex.PCRE> m "a href=\"/magic/magazine/article.aspx?x=mtg/daily/activity/1088\"> ![]()
From the value of b and m, it was weird that the matching was moved forward by 1 char ( the ss (sub matching) was even worse, 2 chars ). Rematch to a and so on gave correct results. It was only the first matching that was broken. Tested with regex-posix (with modified regexp), everything is OK.
I have a similar issue with non-ascii strings. It seems that the internal representation used by Haskell and pcre are different and one of them is counting bytes and the other is counting code points. So they diverge when a multi-byte representation (like utf8) is used. It has been reported previously. See these threads: http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#102959 http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#103029 I am still waiting for a new release of regex-pcre that fixes this issue. Romildo
4:55 a.m.I had similar issues a while ago. It had to do with UTF-8 encoding as far as I can recall. I wanted to "wrap" a multiline string (code listings) within some pandoc generated HTML of a hakyll page with a container "div". The text to wrap would be determined using a PCRE regex. Here the (probably inefficient) implementation: module Transformations where import Hakyll import qualified Text.Regex.PCRE as RE import qualified Data.ByteString.UTF8 as BSU import qualified Data.ByteString as BS -- Wraps numbered code listings within the page body with a div -- in order to be able to apply some more specific styling. wrapNumberedCodelistings (Page meta body) = Page meta newBody where newBody = regexReplace' regex wrap body regex = "
]+>.*?</table>"- wrap x = "
" ++ x ++ "</div>" -- Replace the whole string matched by the given -- regex using the given replacement function (hopefully UTF8-aware) regexReplace' :: String -> (String -> String) -> String -> String regexReplace' pattern replace text = BSU.toString $ go textUTF8 where patternUTF8 = BSU.fromString pattern textUTF8 = BSU.fromString text replaceUTF8 x = BSU.fromString $ replace $ BSU.toString x regex = RE.makeRegexOpts compOpts RE.defaultExecOpt $ BSU.fromString pattern compOpts = RE.compMultiline + RE.compDotAll + RE.compUTF8 + RE.compNoUTF8Check go part = case RE.matchM regex part of Just (before, match, after) -> BS.concat [before, replaceUTF8 match, go after] _ -> part The discussion back then was http://www.haskell.org/pipermail/beginners/2012-June/010064.html Hope this helps. Best regards, Rico Moorman P.S. Sorry for the double email Magicloud ... didn't hit reply all at first On Tue, Dec 18, 2012 at 10:43 AM, José Romildo Malaquias < j.romildo@gmail.com> wrote:On Tue, Dec 18, 2012 at 02:28:26PM +0800, Magicloud Magiclouds wrote:
Attachment is the test text file. And I tested my regexp as this:
Prelude> :m + Text.Regex.PCRE Prelude Text.Regex.PCRE> z <- readFile "test.html" Prelude Text.Regex.PCRE> let (b, m ,a, ss) = z =~ ".*?
b ... n of the Triumvirate</td>\r\n
David Rapoza</td>\r\n \r\n <i>Return to Ravnica</i>\r\n </td>\r\n 10/31/2012</td>\r\n </tr><tr>\r\n <" Prelude Text.Regex.PCRE> m "a href=\"/magic/magazine/article.aspx?x=mtg/daily/activity/1088\"> ![]()
From the value of b and m, it was weird that the matching was moved forward by 1 char ( the ss (sub matching) was even worse, 2 chars ). Rematch to a and so on gave correct results. It was only the first matching that was broken. Tested with regex-posix (with modified regexp), everything is OK.
I have a similar issue with non-ascii strings. It seems that the internal representation used by Haskell and pcre are different and one of them is counting bytes and the other is counting code points. So they diverge when a multi-byte representation (like utf8) is used.
It has been reported previously. See these threads:
http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#102959
http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#103029
I am still waiting for a new release of regex-pcre that fixes this issue.
Romildo
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
9:11 a.m.regex = "
]+>.*?</table>"-
And mind the sneaky single "-" ... it doe not belong there ;-)
19 Dec 19 Dec1:55 a.m.I see. A known bug. Thank you all. On Tue, Dec 18, 2012 at 10:11 PM, Rico Moorman
wrote: regex = "
]+>.*?</table>"-
And mind the sneaky single "-" ... it doe not belong there ;-)
-- 竹密岂妨流水过 山高哪阻野云飞 And for G+, please use magiclouds#gmail.com.
Download4535Age (days ago)4536Last active (days ago)
4 comments3 participantsparticipants (3)
José Romildo Malaquias Magicloud Magiclouds Rico Moorman