Embedding Perl RegEx in Haskell

Dear Haskell folks, I was attempting to do an EDSL that would allow me to describe regular expressions in Hakell and generate Perl as target - https://github.com/ckkashyap/LearningPrograms/blob/master/Haskell/edsl/regex... $ ghci regex.hs GHCi, version 7.0.3: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done. Loading package ffi-1.0 ... linking ... done. [1 of 1] Compiling Main ( regex.hs, interpreted ) Ok, modules loaded: Main. *Main> let hello = stringToRegularExpression "hello" *Main> let world = stringToRegularExpression "world" *Main> :t hello hello :: RegularExpression *Main> :t world world :: RegularExpression *Main> let re = listToSequence [ hello, oneOrMore world, hello ] *Main> re hello(world)+hello *Main> :t re re :: RegularExpression I am looking for suggestions on how I could encode the anchors - ^ and $ and also how I could refer to captures. My goal is to be able to extend the idea to create an EDSL that could be used to build Perl tasks - such as filters, report generators etc Arguably this could be seen as a pointless exercise because I could chose to do the scripting in Haskell itself but I have a practical problem that I have to deal with clusters that have a little dated version of Linux where I could not really build the Haskell and where Haskell generated executables don't seem to run. When I started out with this idea, I was thinking of embedding Perl in Haskell but later I thought that perhaps that will sort of beat the point of "DS" bit in EDSL. So, I have started on this idea of trying to encode very precise "tasks". Regards, Kashyap

On Thu, Aug 18, 2011 at 14:01, C K Kashyap
*Main> let re = listToSequence [ hello, oneOrMore world, hello ] *Main> re hello(world)+hello
I am looking for suggestions on how I could encode the anchors - ^ and $ and also how I could refer to captures. My goal is to be able to extend the idea to create an EDSL that could be used to build Perl tasks - such as filters, report generators etc
Why not just go with anchorHead and anchorTail or similar? And a capture could simply be capture name regularExpression where name can be an Int or support the newer named capture syntaxes. I'm not sure I would bother with fancy symbols; if anything, I might in your position go back to the old v8 regular expressions (or rather Henry Spencer's unencumbered reimplementation) and use the symbolic names from that; they were actually virtual machine opcodes. (Or even capture (Maybe name) regularExpression and (capture Nothing ...) could represent grouping. That's certainly how many actual RE implementations define it.) Arguably this could be seen as a pointless exercise because I could chose to
do the scripting in Haskell itself but I have a practical problem that I have to deal with clusters that have a little dated version of Linux where I could not really build the
(Don't even get me started on Linux backward incompatibility... I spent years fighting that crap, it's one thing I do not at all miss from my old job.) -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

On Sun, Aug 21, 2011 at 00:45, C K Kashyap
Why not just go with anchorHead and anchorTail or similar? And a capture could simply be
Thanks for your inputs Brandon. I've updated the code with anchors. Still trying to get a hold of captures.
What are you finding difficult about them? Backreferences? Given that this is Haskell, I'd look for ways to "tie the knot", although in terms of usability it's probably best to maintain a list of captures where the knot is actually tied between a capture regex component and its captures list entry, with backreferences either also tying the knot with the captures list or containing an offset in that list. Practical usage: you need to be able to access the strings matched by captures from "outside", so you want an independent structure for those captures rather than having to gmap the regex tree to get the captured strings. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

By the way, if what you're actually looking for on a high level is a Haskell-like string matching engine, it might be better to go back to the original sources. Look up Kleene's work on string pattern matching; while regexes derive from it, its more direct descendants are the pattern matching mechanisms in SNOBOL and Icon. I've thought for a while that the right way to do it in Haskell is not parsing a program encoded as a string/regex, but instead a monadic "Kleene machine"; but I've never managed to work out a decent functional implementation (too many irons in the fire, not enough spoons to even keep track of them much less do something useful with them). -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

It just occurred to me that about half the list thinks I just reinvented parser combinators. Not exactly; the thing that distinguishes regexes in particular, and the one whose implications I can't quite wrap my brain around, is that the "many" combinator is actually
many' = reverse . many
But is it really that simple, and is an implementation with decent performance and space characteristics really that simple? Or does something have to be designed from scratch around longest match semantics to get decent behavior? (There's an analogy to folds here. Using this in place of Parsec would have terrible space leaks....) -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms
participants (2)
-
Brandon Allbery
-
C K Kashyap