-- Replace the whole string matched by the given

-- regex using the given replacement function (hopefully UTF8-aware)

regexReplace' :: String -> (String -> String) -> String -> String

regexReplace' pattern replace text = BSU.toString $ go textUTF8

where

patternUTF8 = BSU.fromString pattern

textUTF8 = BSU.fromString text

replaceUTF8 x = BSU.fromString $ replace $ BSU.toString x

regex = RE.makeRegexOpts compOpts RE.defaultExecOpt $ BSU.fromString pattern

compOpts = RE.compMultiline + RE.compDotAll + RE.compUTF8 + RE.compNoUTF8Check

go part = case RE.matchM regex part of

Just (before, match, after) ->

BS.concat [before, replaceUTF8 match, go after]

_ -> part

The discussion back then was http://www.haskell.org/pipermail/beginners/2012-June/010064.html

Hope this helps.

Best regards,

Rico Moorman

P.S. Sorry for the double email Magicloud ... didn't hit reply all at first

On Tue, Dec 18, 2012 at 10:43 AM, José Romildo Malaquias <j.romildo@gmail.com> wrote:

On Tue, Dec 18, 2012 at 02:28:26PM +0800, Magicloud Magiclouds wrote:
> Attachment is the test text file.
> And I tested my regexp as this:
>
> Prelude> :m + Text.Regex.PCRE
> Prelude Text.Regex.PCRE> z <- readFile "test.html"
> Prelude Text.Regex.PCRE> let (b, m ,a, ss) = z =~ "<a
> href=\"(.*?)\">.*?<img class=\"article-image\"" :: (String, String, String,
> [String])
> Prelude Text.Regex.PCRE> b
> ...
> n of the Triumvirate</td>\r\n <td class=\"small\">David Rapoza</td>\r\n
> <td class=\"small\">\r\n <i>Return to Ravnica</i>\r\n </td>\r\n
> <td class=\"small\">10/31/2012</td>\r\n </tr><tr>\r\n <td
> class=\"small\"><"
> Prelude Text.Regex.PCRE> m
> "a href=\"/magic/magazine/article.aspx?x=mtg/daily/activity/1088\"><img
> class=\"article-image\" "
>
> >From the value of b and m, it was weird that the matching was moved forward
> by 1 char ( the ss (sub matching) was even worse, 2 chars ). Rematch to a
> and so on gave correct results. It was only the first matching that was
> broken.
> Tested with regex-posix (with modified regexp), everything is OK.

I have a similar issue with non-ascii strings. It seems that the
internal representation used by Haskell and pcre are different and one
of them is counting bytes and the other is counting code points. So they
diverge when a multi-byte representation (like utf8) is used.

It has been reported previously. See these threads:

http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#102959
http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#103029

I am still waiting for a new release of regex-pcre that fixes this
issue.

Romildo

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe