remove XML tags using Text.Regex.Posix

I have been working with the regular expression package (Text.Regex.Posix). My hope was to find a simple way to remove a pair of XML tags from a short string. I have something like this "<tag>Data</tag>" and would like to extract 'Data'. There is only one tag pair, no nesting, and I know exactly what the tag is. My first attempt was this: "<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String result: "123" Upon further experimenting I realized that it only works with more than 2 digits in 'Data'. I occured to me that my thinking on how this regular expression works was not correct - but I don't understand why it works at all for 3 or more digits. Can anyone help me understand this result and perhaps suggest another strategy? Thank you.

Robert,
On Tue, Sep 29, 2009 at 3:25 PM, Robert Ziemba
I have been working with the regular expression package (Text.Regex.Posix). My hope was to find a simple way to remove a pair of XML tags from a short string.
I have something like this "<tag>Data</tag>" and would like to extract 'Data'. There is only one tag pair, no nesting, and I know exactly what the tag is.
My first attempt was this:
"<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
result: "123"
Upon further experimenting I realized that it only works with more than 2 digits in 'Data'. I occured to me that my thinking on how this regular expression works was not correct - but I don't understand why it works at all for 3 or more digits.
Can anyone help me understand this result and perhaps suggest another strategy? Thank you.
The regex you are using here can be described as such: "Match a character not in the set '<,t,a,g,>', followed by 1 or more of anything, followed by a character not in the set '<,/,t,a,g,>'." Effectively, it will not match if your data has less than 3 characters and is probably not the correct regex for this job, i.e. it would also match "x123x". What you need is regex capturing, but I don't know if that is available in that regex library (I'm not an expert Haskeller). If you really need a regex to locate the tag, you could use a function like this to extract it: getTagData tag s = let match = s =~ ("<" ++ tag ++ ">.*" ++ tag ++ ">")::String dropTag = drop (length tag + 2) s getData = take (length match - (2 * length tag + 5)) dropTag in if length match > 0 then Just getData else Nothing *Main> getTagData "tag" "<tag>123</tag>" Just "123" Patrick
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
-- ===================== Patrick LeBoutillier Rosemère, Québec, Canada

On Tue, Sep 29, 2009 at 12:25:07PM -0700, Robert Ziemba wrote:
I have been working with the regular expression package (Text.Regex.Posix). My hope was to find a simple way to remove a pair of XML tags from a short string.
I have something like this "<tag>Data</tag>" and would like to extract 'Data'. There is only one tag pair, no nesting, and I know exactly what the tag is.
My first attempt was this:
"<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
result: "123"
Upon further experimenting I realized that it only works with more than 2 digits in 'Data'. I occured to me that my thinking on how this regular expression works was not correct - but I don't understand why it works at all for 3 or more digits.
Can anyone help me understand this result and perhaps suggest another strategy? Thank you.
Personally I would have used tagsoup for this sort of thing. Keep in mind the eternal words Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems. -- Jamie Zawinski As you so nicely demonstrated yourself ;-) /M -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe

HXT should be able to do what you're after quite easily from what I've seen.
On Wed, Sep 30, 2009 at 1:58 PM, Magnus Therning
On Tue, Sep 29, 2009 at 12:25:07PM -0700, Robert Ziemba wrote:
I have been working with the regular expression package (Text.Regex.Posix). My hope was to find a simple way to remove a pair of XML tags from a short string.
I have something like this "<tag>Data</tag>" and would like to extract 'Data'. There is only one tag pair, no nesting, and I know exactly what the tag is.
My first attempt was this:
"<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
result: "123"
Upon further experimenting I realized that it only works with more than 2 digits in 'Data'. I occured to me that my thinking on how this regular expression works was not correct - but I don't understand why it works at all for 3 or more digits.
Can anyone help me understand this result and perhaps suggest another strategy? Thank you.
Personally I would have used tagsoup for this sort of thing. Keep in mind the eternal words
Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems. -- Jamie Zawinski
As you so nicely demonstrated yourself ;-)
/M
-- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

On Wed, Sep 30, 2009 at 6:58 AM, Magnus Therning
Personally I would have used tagsoup for this sort of thing. Keep in mind the eternal words
Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems. -- Jamie Zawinski
As you so nicely demonstrated yourself ;-)
Here's a quick and dirty solution using tagsoup: % cat file.xml <tag>123</tag> <tag>456</tag> <tag>789</tag> Text.HTML.Download Text.HTML.TagSoup> tags <- openItem "file.xml" Text.HTML.Download Text.HTML.TagSoup> map (fromTagText . head . tail) $ partitions (TagOpen "tag" [] ~==) (parseTags tags) ["123","456","789"] /M -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe

This is how I did it using the HXT library :
Prelude Text.XML.HXT.Parser.XmlParsec Text.XML.HXT.Arrow.XmlIOStateArrow
Text.XML.HXT.Arrow> runX (readString [] "<tag>123</tag>" >>> getXPathTrees
"tag" >>> getChildren >>> getText)
["123"]
Everything after "Prelude" upto the first ">" is what you have to import to
make this work.
-"readString" converts the input string into a internal representation of an
XML tree
-"getXPathTrees" sets the path to all <tag>'s,
-"getChildren" narrows it down to the data between <tag> and </tag>,
-"getText" extracts all the data between those tags,
-"runX" fires up the whole process and returns the results as a list in the
IO Monad.
hth,
deech
On Tue, Sep 29, 2009 at 2:25 PM, Robert Ziemba
I have been working with the regular expression package (Text.Regex.Posix). My hope was to find a simple way to remove a pair of XML tags from a short string.
I have something like this "<tag>Data</tag>" and would like to extract 'Data'. There is only one tag pair, no nesting, and I know exactly what the tag is.
My first attempt was this:
"<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
result: "123"
Upon further experimenting I realized that it only works with more than 2 digits in 'Data'. I occured to me that my thinking on how this regular expression works was not correct - but I don't understand why it works at all for 3 or more digits.
Can anyone help me understand this result and perhaps suggest another strategy? Thank you.
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

I think regexs are a pain und would suggest the xml-light package for your purpose, which is the smallest xml library. (Or use take, drop, isPrefixOf and isSuffixOf to chop of your tags manually.) http://hackage.haskell.org/package/xml Cheers Christian Prelude Text.XML.Light> concatMap strContent . onlyElems $ parseXML "<tag>123</tag>" "123" Robert Ziemba wrote:
I have been working with the regular expression package (Text.Regex.Posix). My hope was to find a simple way to remove a pair of XML tags from a short string.
I have something like this "<tag>Data</tag>" and would like to extract 'Data'. There is only one tag pair, no nesting, and I know exactly what the tag is.
My first attempt was this:
"<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
result: "123"
Upon further experimenting I realized that it only works with more than 2 digits in 'Data'. I occured to me that my thinking on how this regular expression works was not correct - but I don't understand why it works at all for 3 or more digits.
Can anyone help me understand this result and perhaps suggest another strategy? Thank you.
------------------------------------------------------------------------
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
participants (7)
-
aditya siram
-
Christian Maeder
-
Colin Paul Adams
-
Lyndon Maydwell
-
Magnus Therning
-
Patrick LeBoutillier
-
Robert Ziemba