
On Tue, Dec 18, 2012 at 02:28:26PM +0800, Magicloud Magiclouds wrote:
Attachment is the test text file. And I tested my regexp as this:
Prelude> :m + Text.Regex.PCRE Prelude Text.Regex.PCRE> z <- readFile "test.html" Prelude Text.Regex.PCRE> let (b, m ,a, ss) = z =~ ".*?
b ... n of the Triumvirate</td>\r\n
David Rapoza</td>\r\n \r\n <i>Return to Ravnica</i>\r\n </td>\r\n 10/31/2012</td>\r\n </tr><tr>\r\n <" Prelude Text.Regex.PCRE> m "a href=\"/magic/magazine/article.aspx?x=mtg/daily/activity/1088\"> ![]()
From the value of b and m, it was weird that the matching was moved forward by 1 char ( the ss (sub matching) was even worse, 2 chars ). Rematch to a and so on gave correct results. It was only the first matching that was broken. Tested with regex-posix (with modified regexp), everything is OK.
I have a similar issue with non-ascii strings. It seems that the internal representation used by Haskell and pcre are different and one of them is counting bytes and the other is counting code points. So they diverge when a multi-byte representation (like utf8) is used. It has been reported previously. See these threads: http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#102959 http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#103029 I am still waiting for a new release of regex-pcre that fixes this issue. Romildo