
There's a lot of reasons why I don't recommend that approach, but I think it's best explained by the following now classic stack overflow post and answer http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtm... Basically this applies in your case because recognizing if a sequence of characters is in a comment block or not for HTML is likely not expressible using regexes. There may be a way for a very controlled restricted subset of HTML, but it might require some complex regexes. That said, if you're ok with some false positives and dealing with that, a simple regex based solution is the way to go! Cheers, -- Carter Tazio Schonwald On Friday, March 16, 2012 at 7:08 PM, Joseph Bozeman wrote:
My goal is to remove the HTML comments. It probably would be at least as efficient to use an HTML parser, but I usually strip files by hand, and I always use regex then. I didn't want to bother importing yet another package, because if I could just get this line to work, I could get all my stripping done with three functions, and then I have four that I use to apply a template to the text once it's bare.
On Fri, Mar 16, 2012 at 5:41 PM, Carter Tazio Schonwald
wrote: have you considered using one of the many amazing HTML parsers on hackage?
If the goal is to just get the HTML comments, that might be a much more effective use of your time
-- Carter Tazio Schonwald
On Friday, March 16, 2012 at 4:55 PM, Joseph Bozeman wrote:
Hey everyone, I'm hoping someone can point me in the right direction.
The regex-pcre package exports (=~) and (=~~) as two useful infix functions. They're great! The only problem is, they are a positive match for a regex. I have a file that contains HTML comments (it was generated in Word) and I really just want the barest text. I already have a function that strips out all the tags, and I have a function that finds all the links and sticks those in another file for later perusal.
What I'd like is advice on how to implement the (!~) and (!~~) operators. They should have the same types as (=~) and (=~~). I'm stuck, though. Here's the source for both of those functions: they depend on Text.Rege.PCRE package.
(=~) :: (RegexMaker (http://hackage.haskell.org/packages/archive/regex-base/0.93.2/doc/html/Text-...) Regex (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) CompOption (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) ExecOption (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) source, RegexContext (http://hackage.haskell.org/packages/archive/regex-base/0.93.2/doc/html/Text-...) Regex (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) source1 target) => source1 -> source -> target (=~) x r = let q :: Regex q = makeRegex r in match q x
(=~~) :: (RegexMaker (http://hackage.haskell.org/packages/archive/regex-base/0.93.2/doc/html/Text-...) Regex (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) CompOption (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) ExecOption (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) source, RegexContext (http://hackage.haskell.org/packages/archive/regex-base/0.93.2/doc/html/Text-...) Regex (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) source1 target, Monad (http://hackage.haskell.org/packages/archive/base/4.5.0.0/doc/html/Control-Mo...) m) => source1 -> source -> m target (=~~) x r = do (q :: Regex) <- makeRegexM r matchM q x What I figured I could do was find a function that was the inverse of "match" and "matchM", but I can't find any in the docs. I really hope I don't have to implement that, too. I'm still new at this, and that seems like it would be over my head.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org (mailto:Haskell-Cafe@haskell.org) http://www.haskell.org/mailman/listinfo/haskell-cafe