Re: [Haskell-cafe] Regular Expression with PCRE

There's a lot of reasons why I don't recommend that approach, but I think it's best explained by the following now classic stack overflow post and answer http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtm... Basically this applies in your case because recognizing if a sequence of characters is in a comment block or not for HTML is likely not expressible using regexes. There may be a way for a very controlled restricted subset of HTML, but it might require some complex regexes. That said, if you're ok with some false positives and dealing with that, a simple regex based solution is the way to go! Cheers, -- Carter Tazio Schonwald On Friday, March 16, 2012 at 7:08 PM, Joseph Bozeman wrote:
My goal is to remove the HTML comments. It probably would be at least as efficient to use an HTML parser, but I usually strip files by hand, and I always use regex then. I didn't want to bother importing yet another package, because if I could just get this line to work, I could get all my stripping done with three functions, and then I have four that I use to apply a template to the text once it's bare.
On Fri, Mar 16, 2012 at 5:41 PM, Carter Tazio Schonwald
wrote: have you considered using one of the many amazing HTML parsers on hackage?
If the goal is to just get the HTML comments, that might be a much more effective use of your time
-- Carter Tazio Schonwald
On Friday, March 16, 2012 at 4:55 PM, Joseph Bozeman wrote:
Hey everyone, I'm hoping someone can point me in the right direction.
The regex-pcre package exports (=~) and (=~~) as two useful infix functions. They're great! The only problem is, they are a positive match for a regex. I have a file that contains HTML comments (it was generated in Word) and I really just want the barest text. I already have a function that strips out all the tags, and I have a function that finds all the links and sticks those in another file for later perusal.
What I'd like is advice on how to implement the (!~) and (!~~) operators. They should have the same types as (=~) and (=~~). I'm stuck, though. Here's the source for both of those functions: they depend on Text.Rege.PCRE package.
(=~) :: (RegexMaker (http://hackage.haskell.org/packages/archive/regex-base/0.93.2/doc/html/Text-...) Regex (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) CompOption (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) ExecOption (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) source, RegexContext (http://hackage.haskell.org/packages/archive/regex-base/0.93.2/doc/html/Text-...) Regex (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) source1 target) => source1 -> source -> target (=~) x r = let q :: Regex q = makeRegex r in match q x
(=~~) :: (RegexMaker (http://hackage.haskell.org/packages/archive/regex-base/0.93.2/doc/html/Text-...) Regex (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) CompOption (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) ExecOption (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) source, RegexContext (http://hackage.haskell.org/packages/archive/regex-base/0.93.2/doc/html/Text-...) Regex (http://hackage.haskell.org/packages/archive/regex-pcre/0.94.2/doc/html/Text-...) source1 target, Monad (http://hackage.haskell.org/packages/archive/base/4.5.0.0/doc/html/Control-Mo...) m) => source1 -> source -> m target (=~~) x r = do (q :: Regex) <- makeRegexM r matchM q x What I figured I could do was find a function that was the inverse of "match" and "matchM", but I can't find any in the docs. I really hope I don't have to implement that, too. I'm still new at this, and that seems like it would be over my head.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org (mailto:Haskell-Cafe@haskell.org) http://www.haskell.org/mailman/listinfo/haskell-cafe

On Fri, Mar 16, 2012 at 20:17, Carter Tazio Schonwald < carter.schonwald@gmail.com> wrote:
Basically this applies in your case because recognizing if a sequence of characters is in a comment block or not for HTML is likely not expressible using regexes.
There may be a way for a very controlled restricted subset of HTML, but it might require some complex regexes.
Comments in particular are one of the places where SGML said one thing, the HTML spec which was loosely derived from SGML said a different thing, and most browsers did (occasionally mutually incompatible) something not quite either, with the result that they can be *very* difficult to get right in the general case. HTML is not at all easy to deal with. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms
participants (2)
-
Brandon Allbery
-
Carter Tazio Schonwald