
David Ritchie MacIver wrote:
I was playing with some code for compiling regular expressions to finite state machines and I ran into the following problem. I've solved it, but I'm not terribly happy with my solution and was wondering if someone could come up with a better one. :-)
Essentially I have
data FSM = State { transitions :: (Map Char FSM) }
and
transitions' :: Regexp -> Map Char Regexp
I want to lift this so that the Regexps become states of the finite state machine (while making sure I set up a loop in the data structure). Tying the knot is the traditional way of doing such things, but we couldn't figure out a way to make it work without the set of keys known in advance because of the strictness of Map in its keys (an association list was suggested, and that would probably work, but it seemed a bit ugly and would be fairly inefficient).
In the end what I did was just work out the set of reachable regexps in advance and use a standard tying the knot trick, but it felt vaguely unsatisfactory (and does some repeat work which I felt should be unneccessary). Anyone have a more elegant suggestion?
Regards, David
As others have pointed out, the decision to use Data.Map is a limiting issue. Another approach is the combinator method in CTK Light http://www.cse.unsw.edu.au/~chak/haskell/ctk/ which was specialized and enhanced for regexp in the regex-dfa package [1]. This lazily constructs a DFA from the regular expression. The "tying the knot" happens in the definition of the 'star' combinator (the '*' regexp character, also implied by '+'). The problem with this elegant definition is that it fails badly if the pattern being repeated might succeed after consuming zero characters (it hangs in an infinite loop). Other than that it is a wonderful definition:
type Regexp = Lexer -> Lexer
-- star re1 re2 means repeat re1 follow with re2: "((re1)*)(re2)" star :: Regexp -> Regexp -> Regexp star re1 re2 = \l -> let self = re1 self >||< re2 l in self
My regex-tdfa package did something quite different since it has to deal with a lot of extra complexity. It works in a few stages, in particular because it needs the NFA states to handle subexpression captures and because it has to handle anchors like ^ and $. String of regexp => Parsec extended regexp parser => parse tree data type parse tree => complicated analyzer (uses mdo) => smarter tree data type smarter tree => My complicated assembly monad (uses mdo) => Array Int NFA Where a simplified description of NFA is something like data NFA = NFA Int Trans data Trans = Trans (Map Char (Set Int))-- Might lead to more than one NFA state I could just as easily have made this data Trans = Trans (Map Char (Set NFA)) or data Trans = Trans (Map Char (Set Trans)) by doing a lazy lookup into the array, but it would then not have been as easy to make the DFA in the next step: Array Int NFA => Use of Trie indexed by (Set Int) => DFA Where a simplified DFA is like data DFA = DFA (Set Int) (Map Char DFA) and the Trie means I can lazily lookup any subset of NFA state and get their merge DFA state. So the procedure starts with a simple empty winning NFA to the "right" of the parse tree. The rexgexp tree walk is done in a monad which provides the supply of unique Int index when a new NFA state is created. The last NFA state to be created is the unique start state which always gets the largest Int index. The "tying the knot" trick in building the NFA was handled by walking the regexp parse tree where each node is attached to an NFA representing the future continuation from that node. The tricky case was the one that kills the simple "tying the knot" in CTK Light's method: when you have 'p*' and 'p' might match zero characters. The continuation needed to describe the future in that case had to be supplied in a more complicated form while walking 'p' to avoid the infinite looping. There are no mutable STRef/IORef variables. All the NFA nodes that are created during the monadic traversal are part of the final NFA (so there is no wasted work even though I make a single walk through the tree). The resulting NFA is not as minimal as the differentiation method since my traversal does not look at whether characters in the regexp are equal (my NFA builder is equivalent to treating all the regexp characters are distinct). But this also means I do not have the combinatoric explosion of regexps that the differentiation method can produce. I kept improving the design until it reproduced the same kinds of NFA graphs I could produce manually under those assumptions. The typical NFA state represents the condition "you have just accepted character X in the regexp". This is different from the Thompson NFA where states usually mean "you have just accepted a character leading to character X in the regexp". The NFA that regexp-tdfa produces * has no empty transitions * Captures extended regexp Posix semantics (difficult with empty matches) * handles anchors such as ^ and $ properly * efficiently handle inverted matches like [^a-z] with a default transition * Tracks parenthesized subexpression for later capture * Can handle very tricky cases like {n,m} repetitions Since it has no empty transitions the NFA states are also the simplest states of the DFA. And the DFA states are just merged subsets of the NFA states, thus a simple Trie works wonders. Advanced hints: Things like ((^|a)?b*($|c)?)* are ugly. Getting Posix patterns with {n,m} repetitions right was difficult -- and quick check found out that the Mac OS X regex.h actually contains a bug for some uses of {n,m} repetitions. -- Chris [1] http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-dfa-0.91