Re: [Haskell-beginners] remove XML tags using Text.Regex.Posix

Hi Robert, On Tue, Sep 29, 2009 at 12:25:07PM -0700, Robert Ziemba wrote:
I have been working with the regular expression package (Text.Regex.Posix). My hope was to find a simple way to remove a pair of XML tags from a short string.
I have something like this "<tag>Data</tag>" and would like to extract 'Data'. There is only one tag pair, no nesting, and I know exactly what the tag is.
This is so simple that I would not recommend anything other than regular expressions. Use the following pattern: pat = "<tag>(.*)</tag>" It creates a group withing the matched string containing the data (it is done using parenthesis). Use `[[String]]` as a result type and you receive a list of matches where each match is described by a list of strings whose first member is the whole matched string (including <tag> and </tag>) and it is followed by values of groups (in our case we have just one group). Thus: *Main> "text<tag>data</tag>text" =~ pat :: [[String]] [["<tag>data</tag>","data"]] It is easy extract the data using `(!!)` and `head`: *Main> (!! 1) . head $ ("text<tag>7</tag>text" =~ pat :: [[String]]) "7"
My first attempt was this:
"<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
result: "123"
The problem with your pattern is that `[^<tag>]` doesn't mean what you think it does. Its meaning is “one character which is not `<`, `t`, `a`, or `>`” as Patrick already described in his mail.
Upon further experimenting I realized that it only works with more than 2 digits in 'Data'. I occured to me that my thinking on how this regular expression works was not correct - but I don't understand why it works at all for 3 or more digits.
It doesn't work for all 3 or more digits: *Main> "<tag>tag</tag>" =~ "[^<tag>].+[^</tag>]" :: String "" Briefly, it doesn't work when the data contains one of characters `<`, `t`, `a`, `g`, `>`. Finally, consider using pat = "<tag>([^<]*)</tag>" which works with more tags in the same line as well. Sincerely, jan. -- Heriot-Watt University is a Scottish charity registered under charity number SC000278.

On Wed, Sep 30, 2009 at 11:11 AM, Jan Jakubuv
This is so simple that I would not recommend anything other than regular expressions. Use the following pattern:
pat = "<tag>(.*)</tag>"
Don't use this; the * operator is greedy by default, meaning that will match stuff like "<tag>foo</tag>bar<tag>baz</tag>", and your data will end up being "foo</tag>bar<tag>baz". In other words, a greedy operator tries to consume as much of the string as it possibly can while still matching. If that regex module supports non-greedy operators, you want something like this: pat = "<tag>(.*?)</tag>" A "?" after a greedy operator makes it non-greedy, meaning it will try to match while consuming as little of the string as it can. If the posix regex module doesn't support this, the PCRE-based one should.
participants (2)
-
Jan Jakubuv
-
Tom Tobin