[Haskell-cafe] Regular Expressions, was Re: Interest in helping w/ Haskell standard

15 Oct 2005

      Hi folks,

Inspired by Ralf's post, I thought I'd just GADTize a dependently typed 
program I wrote in 2001.

Wolfgang Jeltsch wrote:
...
Now lets consider using an algebraic datatype for regexps:
data RegExp
      = Empty | Single Char | RegExp :+: RegExp | RegExp :|: RegExpt | Iter RegExp
Manipulating regular expressions now becomes easy and safe – you are just not 
able to create "syntactically incorrect regular expressions" since during 
runtime you don't deal with syntax at all.
A fancier variation on the same theme...
...
data RegExp :: * -> * -> * where
  Zero   :: RegExp tok Empty
  One    :: RegExp tok ()
  Check  :: (tok -> Bool) -> RegExp tok tok
  Plus   :: RegExp tok a -> RegExp tok b -> RegExp tok (Either a b)
  Mult   :: RegExp tok a -> RegExp tok b -> RegExp tok (a, b)
  Star   :: RegExp tok a -> RegExp tok [a]
...
data Empty
The intuition is that a RegExp tok output is a regular expression 
explaining how to parse a list of tok as an output. Here, Zero is the 
regexp which does not accept anything, One accepts just the empty 
string, Plus is choice and Mult is sequential composition; Check lets 
you decide whether you like a single token.

Regular expressions may be seen as an extended language of polynomials 
with tokens for variables; this parser works by repeated application of 
the remainder theorem.
...
parse :: RegExp tok x -> [tok] -> Maybe x
parse r []       = empty r
parse r (t : ts) = case divide t r of
  Div q f -> return f `ap` parse q ts
Example

*RegExp> parse (Star (Mult (Star (Check (== 'a'))) (Star (Check (== 
'b'))))) "abaabaaabbbb"
Just [("a","b"),("aa","b"),("aaa","bbbb")]

The 'remainder' explains if a regular expression accepts the empty 
string, and if so, how. The Star case is a convenient 
underapproximation, ruling out repeated empty values.
...
empty :: RegExp tok a -> Maybe a
empty Zero         = mzero
empty One          = return ()
empty (Check _)    = mzero
empty (Plus r1 r2) = (return Left  `ap` empty r1) `mplus`
                     (return Right `ap` empty r2)
empty (Mult r1 r2) = return (,) `ap` empty r1 `ap` empty r2
empty (Star _)     = return []
The 'quotient' explains how to parse the tail of the list, and how to 
recover the meaning of the whole list from the meaning of the tail.
...
data Division tok x = forall y. Div (RegExp tok y) (y -> x)
Here's how it's done. I didn't expect to need scoped type variables, but 
I did...
...
divide :: tok -> RegExp tok x -> Division tok x
divide t Zero                  = Div Zero naughtE
divide t One                   = Div Zero naughtE
divide t (Check p) | p t       = Div One (const t)
                   | otherwise = Div Zero naughtE
divide t (Plus (r1 :: RegExp tok a) (r2 :: RegExp tok b)) =
  case (divide t r1, divide t r2) of
    (Div (q1 :: RegExp tok a') (f1 :: a' -> a),
      Div (q2 :: RegExp tok b') (f2 :: b' -> b)) ->
      Div (Plus q1 q2) (f1 +++ f2)
divide t (Mult r1 r2) = case (empty r1, divide t r1, divide t r2) of
  (Nothing, Div q1 f1, _) -> Div (Mult q1 r2) (f1 *** id)
  (Just x1, Div q1 f1, Div q2 f2) ->
    Div (Plus (Mult q1 r2) q2) (either (f1 *** id) (((,) x1) . f2))
divide t (Star r) = case (divide t r) of
  Div q f -> Div (Mult q (Star r)) (\ (y, xs) -> (f y : xs))
Bureaucracy.
...
(***) :: (a -> b) -> (c -> d) -> (a, c) -> (b, d)
(f *** g) (a, c) = (f a, g c)
...
(+++) :: (a -> b) -> (c -> d) -> Either a c -> Either b d
(f +++ g) (Left a)  = Left  (f a)
(f +++ g) (Right c) = Right (g c)
...
naughtE :: Empty -> x
naughtE = undefined
It's not the most efficient parser in the world (doing some algebraic 
simplification on the fly wouldn't hurt), but it shows the sort of stuff 
you can do.

Have fun

Conor

[Haskell-cafe] Regular Expressions, was Re: Interest in helping w/ Haskell standard

Conor McBride