
Hello all, For my mud client Yogurt (see hackage) I'm currently working on improving the efficiency of the hooks. Right now several hooks, each consisting of a regex and an action can be active at the same time. Every time a line of input is available (usually several times a second) I run the line through all the available regexes and execute the first matching action. I figured this is not the cleverest approach and it'd be better if I |'ed all regexes into one big DFA. However, how do I then find out which of the original hooks matched and so which action to execute? As far as I know there's no way to do that with Text.Regex. Alex looks promising but is really only an executable and doesn't offer an API. I've also found mr. João Saraiva's HaLex but I don't know if that was meant to be used seriously. Does anyone have any experience with this? What's the best way to achieve this? Thanks much, Martijn.

Martijn van Steenbergen
Does anyone have any experience with this? What's the best way to achieve this?
For anything remotely connected to parsing, always use parsec. I'd not be surprised if the beast is touring complete in itself... or can parse something that can parse itself, which'd be a reverse quine or something. (I'm going to shut up already) -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

On Tue, 2008-11-04 at 18:26 +0100, Achim Schneider wrote:
Martijn van Steenbergen
wrote: Does anyone have any experience with this? What's the best way to achieve this?
For anything remotely connected to parsing, always use parsec.
I'd not be surprised if the beast is touring complete in itself...
This is an interesting question: what would constitute Turing-completeness for an EDSL, independently of the host language? jcc

On Tue, Nov 4, 2008 at 9:26 AM, Achim Schneider
Martijn van Steenbergen
wrote: For anything remotely connected to parsing, always use parsec. I'd not be surprised if the beast is touring complete in itself...
Actually, this can count against you. It's very easy to use Parsec to build an innocent looking grammar that's too slow to use because it'll do all kinds of backtracking to find a way to make your input fit the grammar. I recommend Parsec for lots of tasks, but take care to design the grammar so it doesn't take exponential time to do anything. -- Dan

"Dan Piponi"
On Tue, Nov 4, 2008 at 9:26 AM, Achim Schneider
wrote: Martijn van Steenbergen
wrote: For anything remotely connected to parsing, always use parsec. I'd not be surprised if the beast is touring complete in itself...
Actually, this can count against you. It's very easy to use Parsec to build an innocent looking grammar that's too slow to use because it'll do all kinds of backtracking to find a way to make your input fit the grammar. I recommend Parsec for lots of tasks, but take care to design the grammar so it doesn't take exponential time to do anything.
Considering that he's talking about a mud, I figure the grammar is a quite straightforward command = l[eft] | r[ight] | ... | t[ake] <item> | c[ast] <spell> That is, I'd be very surprised if you even need more than two or three characters lookahead, much less backtracking. Parsec is a thousand times more efficient for such things than regular expressions, and you can just lazily parse all the input into a list of data constructors and happily fold it into your state... The only thing more straightforward than this is reading a xml with HaXML (if you have a DTD, that is) -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

On Tue, Nov 04, 2008 at 08:34:37PM +0100, Achim Schneider wrote:
Parsec is a thousand times more efficient for such things than regular expressions, and you can just lazily parse all the input into a list of data constructors and happily fold it into your state...
I would recommend you to use parsec too; my experience suggests that parsec often results in much cleaner and readable implementation than obscure regexps. -- pierre

G'day all.
Quoting Achim Schneider
Considering that he's talking about a mud, I figure the grammar is a quite straightforward
command = l[eft] | r[ight] | ... | t[ake] <item> | c[ast] <spell>
That is, I'd be very surprised if you even need more than two or three characters lookahead, much less backtracking.
In the case of a command followed by arguments, it would make more sense to use a keyword recogniser followed by a command-specific parser. One suggestion follows. Cheers, Andrew Bromage --------8<---CUT HERE---8<-------- module KeywordMatch (keywordMatch) where import Data.List import Data.Function import Control.Arrow -- Exercise: Why would it be wrong to curry this function? keywordMatch :: (Ord k) => [([k],v)] -> [k] -> Maybe v keywordMatch kvs = compileTrie . generateTrie . sortBy (compare `on` fst) $ kvs data Trie k v = Trie (Maybe v) (Trie' k v) data Trie' k v = Node0 | Node1 k (Trie k v) | Node2 k (Trie k v) k (Trie k v) | Branch k (Trie' k v) (Trie k v) (Trie' k v) generateTrie :: (Ord k) => [([k],v)] -> Trie k v generateTrie (([],v):rest) = Trie (Just v) (generateTrie' rest) generateTrie rest = Trie Nothing (generateTrie' rest) generateTrie' :: (Ord k) => [([k],v)] -> Trie' k v generateTrie' [] = Node0 generateTrie' [(k:ks,v)] = Node1 k $ foldr (\k -> Trie Nothing . Node1 k) (Trie (Just v) Node0) ks generateTrie' [(k1:ks1,v1),(k2:ks2,v2)] = Node2 k1 (generateTrie [(ks1,v1)]) k2 (generateTrie [(ks2,v2)]) generateTrie' kvs = gt . map (head.fst.head &&& map (first tail)) . groupBy ((==) `on` head.fst) $ kvs where gt [] = Node0 gt [(k,kvs)] = Node1 k (generateTrie kvs) gt [(k1,kvs1),(k2,kvs2)] = Node2 k1 (generateTrie kvs1) k2 (generateTrie kvs2) gt kvs = let (l,(k,m):r) = splitAt (length kvs `div` 2) kvs in Branch k (gt l) (generateTrie m) (gt r) compileTrie :: (Ord k) => Trie k v -> [k] -> Maybe v compileTrie (Trie emptyCase trie') = let ctrie' = compileTrie' trie' in \key -> case key of [] -> emptyCase (k:ks) -> ctrie' k ks compileTrie' :: (Ord k) => Trie' k v -> k -> [k] -> Maybe v compileTrie' Node0 = \k ks -> Nothing compileTrie' (Node1 k' t) = let t' = compileTrie t in \k ks -> if k == k' then t' ks else Nothing compileTrie' (Node2 k1 t1 k2 t2) = let t1' = compileTrie t1 t2' = compileTrie t2 in \k ks -> if k == k1 then t1' ks else if k == k2 then t2' ks else Nothing compileTrie' (Branch k' l m r) = let cl = compileTrie' l cm = compileTrie m cr = compileTrie' r in \k ks -> case compare k k' of LT -> cl k ks EQ -> cm ks GT -> cr k ks -- vim: ts=4:sts=4:expandtab

G'day all.
Quoting Achim Schneider
Considering that he's talking about a mud, I figure the grammar is a quite straightforward
command = l[eft] | r[ight] | ... | t[ake] <item> | c[ast] <spell>
That is, I'd be very surprised if you even need more than two or three characters lookahead, much less backtracking.
In the case of a command followed by arguments, it would make more sense to use a keyword recogniser followed by a command-specific parser. One suggestion follows. Cheers, Andrew Bromage --------8<---CUT HERE---8<-------- module KeywordMatch (keywordMatch) where import Data.List import Data.Function import Control.Arrow -- Exercise: Why would it be wrong to curry this function? keywordMatch :: (Ord k) => [([k],v)] -> [k] -> Maybe v keywordMatch kvs = compileTrie . generateTrie . sortBy (compare `on` fst) $ kvs data Trie k v = Trie (Maybe v) (Trie' k v) data Trie' k v = Node0 | Node1 k (Trie k v) | Node2 k (Trie k v) k (Trie k v) | Branch k (Trie' k v) (Trie k v) (Trie' k v) generateTrie :: (Ord k) => [([k],v)] -> Trie k v generateTrie (([],v):rest) = Trie (Just v) (generateTrie' rest) generateTrie rest = Trie Nothing (generateTrie' rest) generateTrie' :: (Ord k) => [([k],v)] -> Trie' k v generateTrie' [] = Node0 generateTrie' [(k:ks,v)] = Node1 k $ foldr (\k -> Trie Nothing . Node1 k) (Trie (Just v) Node0) ks generateTrie' [(k1:ks1,v1),(k2:ks2,v2)] = Node2 k1 (generateTrie [(ks1,v1)]) k2 (generateTrie [(ks2,v2)]) generateTrie' kvs = gt . map (head.fst.head &&& map (first tail)) . groupBy ((==) `on` head.fst) $ kvs where gt [] = Node0 gt [(k,kvs)] = Node1 k (generateTrie kvs) gt [(k1,kvs1),(k2,kvs2)] = Node2 k1 (generateTrie kvs1) k2 (generateTrie kvs2) gt kvs = let (l,(k,m):r) = splitAt (length kvs `div` 2) kvs in Branch k (gt l) (generateTrie m) (gt r) compileTrie :: (Ord k) => Trie k v -> [k] -> Maybe v compileTrie (Trie emptyCase trie') = let ctrie' = compileTrie' trie' in \key -> case key of [] -> emptyCase (k:ks) -> ctrie' k ks compileTrie' :: (Ord k) => Trie' k v -> k -> [k] -> Maybe v compileTrie' Node0 = \k ks -> Nothing compileTrie' (Node1 k' t) = let t' = compileTrie t in \k ks -> if k == k' then t' ks else Nothing compileTrie' (Node2 k1 t1 k2 t2) = let t1' = compileTrie t1 t2' = compileTrie t2 in \k ks -> if k == k1 then t1' ks else if k == k2 then t2' ks else Nothing compileTrie' (Branch k' l m r) = let cl = compileTrie' l cm = compileTrie m cr = compileTrie' r in \k ks -> case compare k k' of LT -> cl k ks EQ -> cm ks GT -> cr k ks -- vim: ts=4:sts=4:expandtab

Excerpts from ajb's message of Wed Nov 05 03:59:03 +0100 2008:
G'day all.
Hi,
Quoting Achim Schneider
: Considering that he's talking about a mud, I figure the grammar is a quite straightforward
command = l[eft] | r[ight] | ... | t[ake] <item> | c[ast] <spell>
That is, I'd be very surprised if you even need more than two or three characters lookahead, much less backtracking.
In the case of a command followed by arguments, it would make more sense to use a keyword recogniser followed by a command-specific parser.
One suggestion follows.
Oops there is a bug in there: GHCI> keywordMatch [("a", 1), ("aa", 2)] "aa" Nothing The third equation of generateTrie' is missing a guard, namely k1 /= k2. generateTrie' [(k1:ks1,v1),(k2:ks2,v2)] | k1 /= k2 = Node2 k1 (generateTrie [(ks1,v1)]) k2 (generateTrie [(ks2,v2)]) Best regards, -- Nicolas Pouillard aka Ertai

On Tue, 2008-11-04 at 10:02 -0800, Dan Piponi wrote:
On Tue, Nov 4, 2008 at 9:26 AM, Achim Schneider
wrote: Martijn van Steenbergen
wrote: For anything remotely connected to parsing, always use parsec. I'd not be surprised if the beast is touring complete in itself...
Actually, this can count against you. It's very easy to use Parsec to build an innocent looking grammar that's too slow to use because it'll do all kinds of backtracking to find a way to make your input fit the grammar. I recommend Parsec for lots of tasks, but take care to design the grammar so it doesn't take exponential time to do anything.
Backtracking points are explicit in Parsec which is one of the whole points of it. This makes it pretty difficult to "innocently" end up with exponential behavior. Backtracking requires a use of the 'try' combinator. It's pretty easy to recognize potentially dangerous uses of 'try'. If you use 'try' willy-nilly or you just throw 'try' in when you are having difficulties then I can quite easily imagine one quickly ending up with exponential behavior. Otherwise, it should not be "easy" to do and if you like you can not use 'try' (or 'try' using combinators) at all and you surely won't get exponential behavior (as far as Parsec is concerned).

Hi Martijn, It's not that tricky if you do a regular expression state machine yourself, but that's probably a bit too much work. One way to get a speed up might be to take the regular expressions a,b,c,d and generate a regex a+b+c+d, and one a+b. You can then check any string s against a+b+c+d, if that matches check a+b, if that matches check a. At each stage you eliminate half the regular expressions, which means a match will take log n, where n is the number of regular expressions. This assumes the underlying regular expression engine constructs a finite state machine, making it O(m) where m is the length of the string to match. Thanks Neil
-----Original Message----- From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Martijn van Steenbergen Sent: 04 November 2008 5:05 pm To: Haskell Cafe Subject: [Haskell-cafe] Efficient parallel regular expressions
Hello all,
For my mud client Yogurt (see hackage) I'm currently working on improving the efficiency of the hooks. Right now several hooks, each consisting of a regex and an action can be active at the same time. Every time a line of input is available (usually several times a second) I run the line through all the available regexes and execute the first matching action.
I figured this is not the cleverest approach and it'd be better if I |'ed all regexes into one big DFA. However, how do I then find out which of the original hooks matched and so which action to execute?
As far as I know there's no way to do that with Text.Regex. Alex looks promising but is really only an executable and doesn't offer an API. I've also found mr. João Saraiva's HaLex but I don't know if that was meant to be used seriously.
Does anyone have any experience with this? What's the best way to achieve this?
Thanks much,
Martijn.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ==============================================================================

Hello everyone, Thank you all for your comments! Those are some very useful ideas. I think I'll try roger's (private) and ChrisK's suggestion first: using the match groups. I'm not sure if the match groups inside the individual regexes will cause much trouble, but we'll see. I imagine I'll have to count parentheses, except when it's followed by a \, except when that \ follows another \, etc. There's probably other situations where a () doesn't count as a group, perhaps when it's followed by a * or +. I'll look into that. If that doesn't work out I'll go for Neil's (from an algorithmic POV beautiful) suggestion. While I understand that some of you suggest I use parsec (or some other mature parser library) I'm pretty sure that's not what I want here. The patterns will almost always be very simple and regular expressions offer an extremely concise way of expressing when a hook should fire. Forcing the user to use full parsers would cause the programs to become much more verbose. Still, Yogurt is flexible enough to allow the user to use parsec if he or she so chooses. Thanks again, Martijn. Mitchell, Neil wrote:
Hi Martijn,
It's not that tricky if you do a regular expression state machine yourself, but that's probably a bit too much work. One way to get a speed up might be to take the regular expressions a,b,c,d and generate a regex a+b+c+d, and one a+b. You can then check any string s against a+b+c+d, if that matches check a+b, if that matches check a. At each stage you eliminate half the regular expressions, which means a match will take log n, where n is the number of regular expressions.
This assumes the underlying regular expression engine constructs a finite state machine, making it O(m) where m is the length of the string to match.
Thanks
Neil

"roger peppe"
On Wed, Nov 5, 2008 at 1:56 PM, Martijn van Steenbergen
wrote: I think I'll try roger's (private)
eek, bitten by "reply to sender only" again!
i had intended to send to the list too.
I recommend using a newsreader and connecting it to gmane, you won't ever have that problem there. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Mitchell, Neil wrote:
Hi Martijn,
It's not that tricky if you do a regular expression state machine yourself, but that's probably a bit too much work. One way to get a speed up might be to take the regular expressions a,b,c,d and generate a regex a+b+c+d, and one a+b. You can then check any string s against a+b+c+d, if that matches check a+b, if that matches check a. At each stage you eliminate half the regular expressions, which means a match will take log n, where n is the number of regular expressions.
This assumes the underlying regular expression engine constructs a finite state machine, making it O(m) where m is the length of the string to match.
If you're implementing the machine yourself, you can implement it as trie automaton which has some "value" associated with each final state. That is, rather than just accepting or rejecting, when the automaton accepts it returns the value associated with the particular final state that it accepted by. This is a trivial extension on DFA/NFA implementations and is perfectly suited to the problem of combining multiple linear tries (sic: regexes) into a single machine. The minimization algorithms are a bit more complex than for DFA/NFAs, but still fairly straightforward if you're only doing prefix merging. And this gets rid of the O(log n) factor. -- Live well, ~wren

The regex-tdfa package (and regex-posix) implement subexpressions capture. So if you want to match alpha beta and gamma in parallel you could write "(alpha)|(beta)|(gamma)" and check which subexpression has the non-empty match. This becomes slightly complicated if there are parenthesis and captures inside alpha beta or gamma. Then you need to compute the indices that are the top level captures. In particular, the regex-tdfa package (get the latest from http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa ) will create a DFA and run through the input once without backtracking. It will find the leftmost-longest match, so the order of the branches only matters if there is a tie in length. If you need to be left-biased then you need a perl-style engine, and you can use the regex-pcre or pcre-light haskell package and the PCRE library. These are obtainable from Hackage. I doubt PCRE uses a simple DFA... Cheers, Chris Martijn van Steenbergen wrote:
Hello all,
For my mud client Yogurt (see hackage) I'm currently working on improving the efficiency of the hooks. Right now several hooks, each consisting of a regex and an action can be active at the same time. Every time a line of input is available (usually several times a second) I run the line through all the available regexes and execute the first matching action.
I figured this is not the cleverest approach and it'd be better if I |'ed all regexes into one big DFA. However, how do I then find out which of the original hooks matched and so which action to execute?
As far as I know there's no way to do that with Text.Regex. Alex looks promising but is really only an executable and doesn't offer an API. I've also found mr. Jo�o Saraiva's HaLex but I don't know if that was meant to be used seriously.
Does anyone have any experience with this? What's the best way to achieve this?
Thanks much,
Martijn.

ChrisK wrote:
If you need to be left-biased then you need a perl-style engine, and you can use the regex-pcre or pcre-light haskell package and the PCRE library. These are obtainable from Hackage. I doubt PCRE uses a simple DFA...
I don't know if regex-pcre or pcre-light supports the (?{...})-ism of Perl 5.6 and above, but if it does then it's fairly easy to implement the trie automaton I mentioned in my previous post: just add a (?{ $value = ... }) action to the end of each component regex and read out the value of $value after you match. You'll still want to run the automata through a minimizer/optimizer after gluing all the components together, otherwise you still get O(n) behavior since you don't share the work for common prefixes. -- Live well, ~wren

Hello wren, Thursday, November 6, 2008, 12:00:22 PM, you wrote:
the trie automaton I mentioned in my previous post: just add a (?{ $value = ... }) action to the end of each component regex and read out the value of $value after you match.
$value? in haskell? :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello wren,
Thursday, November 6, 2008, 12:00:22 PM, you wrote:
the trie automaton I mentioned in my previous post: just add a (?{ $value = ... }) action to the end of each component regex and read out the value of $value after you match.
$value? in haskell? :)
Shh, they'll notice :) -- "callCC implementation left as an exercise for the reader" ~wren

On Tue, Nov 4, 2008 at 9:05 AM, Martijn van Steenbergen
For my mud client Yogurt (see hackage) I'm currently working on improving the efficiency of the hooks. Right now several hooks, each consisting of a regex and an action can be active at the same time. Every time a line of input is available (usually several times a second) I run the line through all the available regexes and execute the first matching action.
I figured this is not the cleverest approach and it'd be better if I |'ed all regexes into one big DFA. However, how do I then find out which of the original hooks matched and so which action to execute?
Is this really a problem in practice? I've done similar things in Ruby, which is a much slower language, and not had any issues - particularly in something IO bound like a MUD client it doesn't seem that running down a few tens of regexps would be a bottleneck of any sort. martin

Hi Martijn,
If you are brave to start implementing DFA with all required
optimisations then you might want to look at:
http://www.ontotext.com/gate/japec.html
This is a compiler for language called JAPE. In the language you
define a set of rules where the right hand side
is a regular expression and the left hand side is a Java code. The
compiler itself is implemented in Haskell.
It includes code to build DFA from the set of regexps and then it does
determinization and minimization.
I wrote the compiler few years ago. You can decide to take and change
the code or to reimplement it yourself. Definitely DFA guarantees that
the performance is always linear while with Parsec you have to be
careful.
Regards,
Krasimir
On Tue, Nov 4, 2008 at 6:05 PM, Martijn van Steenbergen
Hello all,
For my mud client Yogurt (see hackage) I'm currently working on improving the efficiency of the hooks. Right now several hooks, each consisting of a regex and an action can be active at the same time. Every time a line of input is available (usually several times a second) I run the line through all the available regexes and execute the first matching action.
I figured this is not the cleverest approach and it'd be better if I |'ed all regexes into one big DFA. However, how do I then find out which of the original hooks matched and so which action to execute?
As far as I know there's no way to do that with Text.Regex. Alex looks promising but is really only an executable and doesn't offer an API. I've also found mr. João Saraiva's HaLex but I don't know if that was meant to be used seriously.
Does anyone have any experience with this? What's the best way to achieve this?
Thanks much,
Martijn.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

using strings (inside a program) to represent structured data is wrong (*). of course you need strings for interfacing the "outside" world, but the microsecond they get on the inside, they should be tokenized and parsed away into something useful (= an abstract syntax tree). (*) corollary: using strings to represent regular expressions is also wrong...

Johannes Waldmann schrieb:
using strings (inside a program) to represent structured data is wrong (*).
of course you need strings for interfacing the "outside" world, but the microsecond they get on the inside, they should be tokenized and parsed away into something useful (= an abstract syntax tree).
(*) corollary: using strings to represent regular expressions is also wrong...
I consider these regular expression strings an embedded domain-specific language. It seems to me, that putting regexps into strings was a work-around, because you could not extend Haskell's syntax. But now things change with this new GHC feature - what was its name? Nevertheless, I never used regexps in Haskell programs, parsec is much nicer.
participants (18)
-
Achim Schneider
-
ajb@spamcop.net
-
Bulat Ziganshin
-
ChrisK
-
Dan Piponi
-
Derek Elkins
-
Henning Thielemann
-
Johannes Waldmann
-
Jonathan Cast
-
Krasimir Angelov
-
Martijn van Steenbergen
-
Martin DeMello
-
Mitchell, Neil
-
Nicolas Pouillard
-
pierre
-
Richard O'Keefe
-
roger peppe
-
wren ng thornton