
On 11/12/10 6:56 PM, Michael Litchard wrote:
I've been working on a project that requires me to do screen scraping. When I first started this, I worked off of other people's examples. Not one used regex. By luck I found someone at work to help me along this project. His clues and hints don't use regex either. I was at a point where I had to make a decision concerning design, so I asked the guy sitting next to me at work. He's very experienced, and comes from a Perl perspective. I let him into what I was doing, and he opined I should be using pcre. So now I'm second guessing my choices. Why do people choose not to use regex for uri parsing?
As the grammar becomes more complex (i.e., as your patterns become more nuanced), using a real parser framework helps to improve code legibility since you can factor parts of the grammar out, give them names, etc. In addition to the documentation effects, this refactoring also allows you to make your grammars modular by using the same subgrammar in multiple places. While technically you can do the same factoring for constructing the regex that gets handed off to pcre, almost noone does that in practice. Also, using a real parsing framework allows you to construct more powerful grammars than regular grammars, so if you need the power of unbounded recursion or of context sensitivity, then regular expressions are out. Technically Perl's regexen are Turing complete and aren't "regular expressions" at all; pcre has inherited some of that extra power, put the point still holds at large. Even with more restricted regexen than Perl has, the modern idea of a "regex" isn't regular at all. Beginning of sentence and end of sentence anchors are not regular properties, which allows you to have the worst kind of fun :) http://zmievski.org/2010/08/the-prime-that-wasnt Even if you did decide to go for regular expressions, pcre chooses a specific implementation for handling choice (namely backtracking search). Depending on your grammars and the text they'll be applied to, this may not be the most efficient implementation since backtracking can lead to exponential behaviors that other regex implementations don't have. Also, regexes are apparently very difficult to implement *correctly*: http://www.haskell.org/haskellwiki/Regex_Posix -- Live well, ~wren