"Tying the knot" with unknown keys

I was playing with some code for compiling regular expressions to finite state machines and I ran into the following problem. I've solved it, but I'm not terribly happy with my solution and was wondering if someone could come up with a better one. :-) Essentially I have data FSM = State { transitions :: (Map Char FSM) } and transitions' :: Regexp -> Map Char Regexp I want to lift this so that the Regexps become states of the finite state machine (while making sure I set up a loop in the data structure). Tying the knot is the traditional way of doing such things, but we couldn't figure out a way to make it work without the set of keys known in advance because of the strictness of Map in its keys (an association list was suggested, and that would probably work, but it seemed a bit ugly and would be fairly inefficient). In the end what I did was just work out the set of reachable regexps in advance and use a standard tying the knot trick, but it felt vaguely unsatisfactory (and does some repeat work which I felt should be unneccessary). Anyone have a more elegant suggestion? Regards, David

David Ritchie MacIver wrote:
I was playing with some code for compiling regular expressions to finite state machines and I ran into the following problem. I've solved it, but I'm not terribly happy with my solution and was wondering if someone could come up with a better one. :-)
Essentially I have
data FSM = State { transitions :: (Map Char FSM) }
and
transitions' :: Regexp -> Map Char Regexp
I want to lift this so that the Regexps become states of the finite state machine (while making sure I set up a loop in the data structure). Tying the knot is the traditional way of doing such things, but we couldn't figure out a way to make it work without the set of keys known in advance because of the strictness of Map in its keys (an association list was suggested, and that would probably work, but it seemed a bit ugly and would be fairly inefficient).
In the end what I did was just work out the set of reachable regexps in advance and use a standard tying the knot trick, but it felt vaguely unsatisfactory (and does some repeat work which I felt should be unneccessary). Anyone have a more elegant suggestion?
Hmm. I tried and came up with this:
import Data.Maybe import Data.Map (Map)
data Graph b = Graph (Map b (Graph b))
buildTransitionGraph :: (Ord a, Ord b) => (a -> Map b a) -> a -> Graph b buildTransitionGraph f i = fromJust $ i `M.lookup` build M.empty [i] where -- build :: Map a (Graph b) -> [a] -> Map a (Graph b) build g [] = g build g (a:as) = g'' where -- g'' :: Map a (Graph b) g'' = build g' as' (as', g') = foldr step (as, g) (M.toList (f a)) step (l, n) (as, g) | M.member n g = (as, g) | otherwise = (n:as, M.insert n (f' n) g) -- f' :: a -> Graph b f' = Graph . M.map (fromJust . (`M.lookup` g'')) . f
which couples the knot tying with finding the reachable states. 'build' takes a map of states seen so far to their corresponding 'Graph' node, and a working stack of states not processed yet and processes a single state. 'step' processes a single transition. If it leads to an unknown state, the state is added to the seen state map. The knot is tied between the final result of the calculation, g'', and the map that is being built - this happens in f'. Test:
t :: Int -> Map Int Int t 1 = M.fromList [(1,2),(2,1)] t 2 = M.fromList [(1,2),(3,1),(4,3)] t _ = M.empty
traces :: Ord b => Int -> Graph b -> [[b]] traces 0 g = [[]] traces d (Graph g) = concat [map (n:) (trace (d-1) g') | (n, g') <- M.toList g]
*Main> trace 1 $ buildTransitionGraph t 1 [[1],[2]] *Main> trace 1 $ buildTransitionGraph t 2 [[1],[3],[4]] *Main> trace 2 $ buildTransitionGraph t 1 [[1,1],[1,3],[1,4],[2,1],[2,2]] It's still not lazy though. The potential lookups of states that haven't been seen yet makes this hard to accomplish, although it should be possible with an unbalanced search tree and some clever use of irrefutable patterns. enjoy, Bertram

On 8/20/07, David Ritchie MacIver
I was playing with some code for compiling regular expressions to finite state machines and I ran into the following problem.
I've met exactly the same problem myself and you got me interested in it again. I think the tricky part isn't so much the knot-tying, but the fact that you need a high performance Map-like datastructure that doesn't die the way Data.Map.fromList would if you gave it an infinite list as argument. One approach might be to replace Map k a with something like a data UltraLazyMap k a = ULM (Map k a) [(k,a)] The idea is that the Map part is built only as needed and the list part represents the elements not yet inserted into the tree. When you come to perform a lookup you first look in the Map part. If you don't find what you want there you start looking through the list (assuming that when you come to do lookups, every key you need eventually appears at least once in the list). Each time you look at a list element you remove it from the list and insert it into the tree. That way you never try to build an "infinite" tree and instead grow it as needed. This would have a similar amortised performance as a regular Map, but the price is that lookups change the structure and so you need mutable state. But that's OK, you just stick all of your code in a State monad. I don't know if that State monad would ultimately mess up any attempt to eventually tie your knots, and I probably won't have time to try coding this up at lest until the weekend. So take all of this with a pinch of salt :-) Good luck! :-) And if that doesn't work, I also have another approach I'm thinking about... -- Dan

On Mon, Aug 20, 2007 at 03:39:28PM -0700, Dan Piponi wrote:
On 8/20/07, David Ritchie MacIver
wrote: I was playing with some code for compiling regular expressions to finite state machines and I ran into the following problem.
I've met exactly the same problem myself and you got me interested in it again.
I think the tricky part isn't so much the knot-tying, but the fact that you need a high performance Map-like datastructure that doesn't die the way Data.Map.fromList would if you gave it an infinite list as argument.
One approach might be to replace Map k a with something like a
data UltraLazyMap k a = ULM (Map k a) [(k,a)]
The idea is that the Map part is built only as needed and the list part represents the elements not yet inserted into the tree. When you come to perform a lookup you first look in the Map part. If you don't find what you want there you start looking through the list (assuming that when you come to do lookups, every key you need eventually appears at least once in the list). Each time you look at a list element you remove it from the list and insert it into the tree. That way you never try to build an "infinite" tree and instead grow it as needed. This would have a similar amortised performance as a regular Map, but the price is that lookups change the structure and so you need mutable state. But that's OK, you just stick all of your code in a State monad.
I don't know if that State monad would ultimately mess up any attempt to eventually tie your knots, and I probably won't have time to try coding this up at lest until the weekend. So take all of this with a pinch of salt :-)
Good luck! :-)
And if that doesn't work, I also have another approach I'm thinking about...
You could also just build the map lazily.
data Map k a = Fork k a (Map k a) (Map k a) | Leaf
...
insertMany (Fork k v l r) xs = Fork k v (insertMany l $ filter (

David Ritchie MacIver wrote:
Essentially I have
data FSM = State { transitions :: (Map Char FSM) }
and
transitions' :: Regexp -> Map Char Regexp
I want to lift this so that the Regexps become states of the finite state machine (while making sure I set up a loop in the data structure). Tying the knot is the traditional way of doing such things, but we couldn't figure out a way to make it work without the set of keys known in advance because of the strictness of Map in its keys (an association list was suggested, and that would probably work, but it seemed a bit ugly and would be fairly inefficient).
In the end what I did was just work out the set of reachable regexps in advance and use a standard tying the knot trick, but it felt vaguely unsatisfactory (and does some repeat work which I felt should be unneccessary). Anyone have a more elegant suggestion?
I have a solution that I like now, even though it involves quite a bit of code. Its core idea is very simple. The main ingredient is
data RegexpTrie a
which is a data type that represents an infinite trie-like structure, indexed by regular expressions. It comes with a lookup function,
lookupRE :: RegexpTrie a -> Regexp -> a
with the obvious semantics. It also provides a function to populate a trie,
populateRE :: (Regexp -> a) -> RegexpTrie a
With these functions we can build a map of *all* regular expressions to their corresponding FSM. This is where the knot-tying takes place:
fsm :: RegexpTrie FSM fsm = populateRE (\re -> State { transitions = Map.map (lookupRE fsm) (transitions' re) }
Finally, 'compile' becomes a trivial lookup,
compile :: Regexp -> FSM compile x = lookupRE fsm x
Detailed code can be found at http://hpaste.org/2341#a3 . enjoy, Bertram

G'day all.
Quoting David Ritchie MacIver
I was playing with some code for compiling regular expressions to finite state machines and I ran into the following problem. I've solved it, but I'm not terribly happy with my solution and was wondering if someone could come up with a better one. :-)
Doing structural induction is quite viable, as you noted. However, there are advantages to implementing explicit indirection. The main one is memory usage. Explicit indirection is much less leak-prone, and you can easily share structures thanks to hash consing. At any rate, I'm extremely curious as to how your code compares with mine, performance-wise: http://www.ninebynine.org/Software/HaskellRDF/Dfa/Dfa.lhs Cheers, Andrew Bromage

ajb@spamcop.net wrote:
G'day all.
Quoting David Ritchie MacIver
: I was playing with some code for compiling regular expressions to finite state machines and I ran into the following problem. I've solved it, but I'm not terribly happy with my solution and was wondering if someone could come up with a better one. :-)
Doing structural induction is quite viable, as you noted.
However, there are advantages to implementing explicit indirection. The main one is memory usage. Explicit indirection is much less leak-prone, and you can easily share structures thanks to hash consing.
At any rate, I'm extremely curious as to how your code compares with mine, performance-wise:
http://www.ninebynine.org/Software/HaskellRDF/Dfa/Dfa.lhs
Cheers, Andrew Bromage
I'd be astonished if yours didn't beat mine hands down. I'm just putting this together as an exercise to help me figure out some of the areas where my Haskell knowledge is extremely weak. It's totally unoptimised.

David Ritchie MacIver wrote:
I was playing with some code for compiling regular expressions to finite state machines and I ran into the following problem. I've solved it, but I'm not terribly happy with my solution and was wondering if someone could come up with a better one. :-)
Essentially I have
data FSM = State { transitions :: (Map Char FSM) }
and
transitions' :: Regexp -> Map Char Regexp
I want to lift this so that the Regexps become states of the finite state machine (while making sure I set up a loop in the data structure). Tying the knot is the traditional way of doing such things, but we couldn't figure out a way to make it work without the set of keys known in advance because of the strictness of Map in its keys (an association list was suggested, and that would probably work, but it seemed a bit ugly and would be fairly inefficient).
In the end what I did was just work out the set of reachable regexps in advance and use a standard tying the knot trick, but it felt vaguely unsatisfactory (and does some repeat work which I felt should be unneccessary). Anyone have a more elegant suggestion?
Regards, David
As others have pointed out, the decision to use Data.Map is a limiting issue. Another approach is the combinator method in CTK Light http://www.cse.unsw.edu.au/~chak/haskell/ctk/ which was specialized and enhanced for regexp in the regex-dfa package [1]. This lazily constructs a DFA from the regular expression. The "tying the knot" happens in the definition of the 'star' combinator (the '*' regexp character, also implied by '+'). The problem with this elegant definition is that it fails badly if the pattern being repeated might succeed after consuming zero characters (it hangs in an infinite loop). Other than that it is a wonderful definition:
type Regexp = Lexer -> Lexer
-- star re1 re2 means repeat re1 follow with re2: "((re1)*)(re2)" star :: Regexp -> Regexp -> Regexp star re1 re2 = \l -> let self = re1 self >||< re2 l in self
My regex-tdfa package did something quite different since it has to deal with a lot of extra complexity. It works in a few stages, in particular because it needs the NFA states to handle subexpression captures and because it has to handle anchors like ^ and $. String of regexp => Parsec extended regexp parser => parse tree data type parse tree => complicated analyzer (uses mdo) => smarter tree data type smarter tree => My complicated assembly monad (uses mdo) => Array Int NFA Where a simplified description of NFA is something like data NFA = NFA Int Trans data Trans = Trans (Map Char (Set Int))-- Might lead to more than one NFA state I could just as easily have made this data Trans = Trans (Map Char (Set NFA)) or data Trans = Trans (Map Char (Set Trans)) by doing a lazy lookup into the array, but it would then not have been as easy to make the DFA in the next step: Array Int NFA => Use of Trie indexed by (Set Int) => DFA Where a simplified DFA is like data DFA = DFA (Set Int) (Map Char DFA) and the Trie means I can lazily lookup any subset of NFA state and get their merge DFA state. So the procedure starts with a simple empty winning NFA to the "right" of the parse tree. The rexgexp tree walk is done in a monad which provides the supply of unique Int index when a new NFA state is created. The last NFA state to be created is the unique start state which always gets the largest Int index. The "tying the knot" trick in building the NFA was handled by walking the regexp parse tree where each node is attached to an NFA representing the future continuation from that node. The tricky case was the one that kills the simple "tying the knot" in CTK Light's method: when you have 'p*' and 'p' might match zero characters. The continuation needed to describe the future in that case had to be supplied in a more complicated form while walking 'p' to avoid the infinite looping. There are no mutable STRef/IORef variables. All the NFA nodes that are created during the monadic traversal are part of the final NFA (so there is no wasted work even though I make a single walk through the tree). The resulting NFA is not as minimal as the differentiation method since my traversal does not look at whether characters in the regexp are equal (my NFA builder is equivalent to treating all the regexp characters are distinct). But this also means I do not have the combinatoric explosion of regexps that the differentiation method can produce. I kept improving the design until it reproduced the same kinds of NFA graphs I could produce manually under those assumptions. The typical NFA state represents the condition "you have just accepted character X in the regexp". This is different from the Thompson NFA where states usually mean "you have just accepted a character leading to character X in the regexp". The NFA that regexp-tdfa produces * has no empty transitions * Captures extended regexp Posix semantics (difficult with empty matches) * handles anchors such as ^ and $ properly * efficiently handle inverted matches like [^a-z] with a default transition * Tracks parenthesized subexpression for later capture * Can handle very tricky cases like {n,m} repetitions Since it has no empty transitions the NFA states are also the simplest states of the DFA. And the DFA states are just merged subsets of the NFA states, thus a simple Trie works wonders. Advanced hints: Things like ((^|a)?b*($|c)?)* are ugly. Getting Posix patterns with {n,m} repetitions right was difficult -- and quick check found out that the Mac OS X regex.h actually contains a bug for some uses of {n,m} repetitions. -- Chris [1] http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-dfa-0.91
participants (6)
-
ajb@spamcop.net
-
Bertram Felgenhauer
-
ChrisK
-
Dan Piponi
-
David Ritchie MacIver
-
Stefan O'Rear