
Bulat Ziganshin wrote:
Hello Chris,
Wednesday, March 8, 2006, 1:26:37 AM, you wrote:
CK> It takes the string form of the regular expression and uses Parsec to create a
he-he, i written the same thing (but very simple) 2 years ago :) i planned to submit it to the Parsec developers as an example of double-Parsec usage :)
The proposed shootout version on the wiki ( http://haskell.org/hawiki/RegexDna#head-ac7a5b838757d66780247397221f3b4f1ace... ) uses p_regexp :: CharParser () (CharParser st () -> CharParser st ()) which is even more bizarre at first glance. But this is not what my full library uses.
i think that it is a great lib, but not sure that it should completely replace current lib. old lib is more appropriate for packed string, new lib work directly with Haskell strings
Exactly. If you have a packed ascii (or unicode?) string you should call c via regex.h or pcre.h to do the matching. But doing this with length 10^6 [Char] via Text.Regex is next to impossible. Thus the niche for the [Char] version I have created. Another nice thing is that the Parsec versions of matchRegex / matchRegexAll / subRegex / splitRegex are lazy, so you could substitute or split an infinite string. Last night "Igloo" on the #haskell shared a HUnit test suite he used for his personal version of basic and extended regular expression matching. This located two bugs and one specification error in my code (all now fixed). After some more testing I will be looking for a place to post it. Is there somewhere on www.haskell.org that would work?
one more interesting thing - generation of faster and simpler parsers for simple regexps. just as example, code from my own program, that parse filename wildcards. it translates simple patterns directly to the "String->Bool" functions and use Regex library for more complex patterns
Hmmm...Yes. Another String->Pattern parser (probably in Parsec) could transform filename wildcards. But that would lose information on the simplicity. I have not created the infrastructure for such alternatives or meta-data (such as "anchored only at start of string" or "only uses greedy operators" or "does not need back-references" or "can be reduced to a DFA").
-- |Compiled regexpr representation EXAMPLE data RegExpr = RE_End -- "" | RE_Anything -- "*" | RE_FromEnd RegExpr -- '*':"bc" | RE_AnyChar RegExpr -- '?':"bc" | RE_Char Char RegExpr -- 'a':"bc" | RE_FullRE Regex -- "r[0-9][0-9]"
My parsed form of the string Regex is the Pattern data type. It is not used to actually do matching (though that would be possible), but to later compile a Parsec parser. data Pattern = PEmpty | PCarat | PDollar | PFail String | PGroup PatternIndex Pattern | POr [Pattern] | PConcat [Pattern] | PQuest Pattern -- ? | PPlus Pattern -- + | PStar Pattern -- * | PBound Int (Maybe Int) Pattern -- {3} or {3,5} or {3,} -- PLazy indicates the pattern should try the shortest matches first | PLazy Pattern -- non-greedy wrapper (?+*{} followed by ?) -- PPossessive indicates the pattern can only fit the longest match | PPossessive Pattern -- possessive modifier (?+*{} followed by +) | PDot -- Any character (newline?) at all | PAny PatternSet -- Square bracketed things | PAnyNot PatternSet -- Inverted square bracketed things | PEscape Char -- Backslashed Character | PChar Char -- Specific Character -- After "simplify" adjacent PChar are merge'd into PString | PString String deriving (Eq,Show) Where PatternSet is usually just a (Set Char) and Set of [:alpha:] character classes. It also holds parsed [.ch.] and [=x=] expressions, but these are not really implemented in the matching. PGroup is an empty or non-empty group "()" or "(foo)" with the back reference index. Also note that Pattern does not differentiate different types of escaped characters (e.g. \a \* \4 are all PEscape patterns). The meaning of PDot varies depending on the options. Multiline expressions do not match PDot with '\n' and to agree with Text.Regex '\NUL' characters are disallowed in the string regex and not matched by PDot. PCarat and PDollar actions also depend on multiline, and case sensitivity affects the character/string matching. I ought to make another option to allow '\NUL' to be treated as a regular character. -- Chris Kuklewicz