Regular Expressions without GADTs

Conor McBride wrote:
Inspired by Ralf's post, I thought I'd just GADTize a dependently typed program I wrote in 2001.
Equally inspired, I thought of deGADTizing that code. The code below also uses no existentials, and no local type annotations. The code is more general in that the parser works in an arbitrary MonadPlus. If that happens to be the List monad, we can see alternative parses. The tests illustrate alternative parses and parsing of a list of integers, for example. The code seems a bit simpler, too. {-# OPTIONS -fglasgow-exts #-} module RX where import Control.Monad -- If monadPlus is List, we can get alternative parses class RegExp t tok a | t tok -> a where parse :: MonadPlus m => t -> [tok] -> m a data Zero = Zero data One = One newtype Check tok = Check (tok -> Bool) data Plus r1 r2 = Plus r1 r2 data Mult r1 r2 = Mult r1 r2 data Push tok r = Push tok r newtype Star r = Star r data Void = Void -- Parser Zero fails everywhere instance RegExp Zero tok Void where parse _ _ = mzero -- Parser One recognizes the empty string instance RegExp One tok () where parse _ [] = return () parse _ _ = mzero -- Parser (Check p) recognizes a single-element stream [tok] -- provided tok satisfies the predicate p instance RegExp (Check tok) tok tok where parse (Check p) [t] | p t = return t parse _ _ = mzero -- Parser (Push tok r) prepends tok to the stream and gets parser -- r to recognize the result instance RegExp r tok a => RegExp (Push tok r) tok a where parse (Push tok r) ts = parse r (tok:ts) -- Non-deterministic choice of parsers r1 and r2, applied to the same stream instance (RegExp r1 tok a, RegExp r2 tok b) => RegExp (Plus r1 r2) tok (Either a b) where parse (Plus r1 r2) ts = (liftM Left $ parse r1 ts) `mplus` (liftM Right $ parse r2 ts) -- Mult r1 r2: A sequence of parsers r1 and r2. -- Note the order: we search for the longest prefix of the input stream -- that satisfies the parser r1 instance (RegExp r1 tok a, RegExp r2 tok b) => RegExp (Mult r1 r2) tok (a, b) where parse (Mult r1 r2) ts@[] = liftM2 (,) (parse r1 ts) (parse r2 ts) parse (Mult r1 r2) (t:ts) = parse (Mult (Push t r1) r2) ts `mplus` liftM2 (,) (parse r1 ([] `asTypeOf` ts)) (parse r2 (t:ts)) -- Kleene's star instance RegExp r tok a => RegExp (Star r) tok [a] where parse (Star r) [] = return [] parse (Star r) ts = parse (Mult r (Star r)) ts >>= (\ (x,xs) -> return $ x:xs) p1 = (Mult (Star (Check (== 'a'))) (Star (Check (== 'b')))) asMayBe :: Maybe a -> Maybe a asMayBe = id asList :: [a] -> [a] asList = id testp = asMayBe $ parse (Star (Mult (Star (Check (== 'a'))) (Star (Check (== 'b'))))) "abaabaaabbbb" -- *RX> testp -- Just [("a","b"),("aa","b"),("aaa","bbbb")] -- see alternative parses testp1 = take 3 $ asList $ parse (Star (Mult (Star (Check (== 'a'))) (Star (Check (== 'b'))))) "abaabaaabbbb" -- Parsing the list of integers ieven = even :: Int->Bool iodd = odd :: Int->Bool p2 = Plus (Mult (Check iodd) (Mult (Check iodd) (Star (Check ieven)))) (Mult (Check ieven) (Star (Check iodd))) -- the parser is ambiguous. We can see the alternatives test2 = take 3 $ asList $ parse (Star p2) [1::Int,1,2,3,3,4]

The code seems a bit simpler, too.
Do you really think so? To me replacing a GADT by class and instance declarations seems the wrong way round. We should not forget that the DT in GADT stands for `data type'. One could certainly argue that the gist of functional programming is to define a collection of data types and functions that operate on these types. The fact that with GADTs constructors can have more general types just allows us to use the functional design pattern more often (freeing us from the need or temptation to resort to type class hackery with multiple parameter type classes, undecidable instances, functional dependencies etc). Cheers, Ralf

GADTs have the problem that they're not extensible, but type classes are! Imagine if we add regular hedges. So, I think Oleg has a point here. Oleg's solution may look rather cryptic. But if we recall that type class instances describe proof systems, then Oleg's solutions starts looking quite elegant. Note that the GADTs serve the same purpose, i.e. encoding proof rules. An interesting question is whether there's something GADTs can do "better" than type classes and vice versa. Here's an example, many regexp algorithms can be formulated co-inductively. We'll have trouble to encode these algorithms with "standard" type classes, unless we consider some type class extensions (if this sounds interesting check out the "co-induction type class" paper on my home-page). I don't think this will cause trouble for GADTs. Martin Ralf Hinze writes:
The code seems a bit simpler, too.
Do you really think so? To me replacing a GADT by class and instance declarations seems the wrong way round. We should not forget that the DT in GADT stands for `data type'. One could certainly argue that the gist of functional programming is to define a collection of data types and functions that operate on these types. The fact that with GADTs constructors can have more general types just allows us to use the functional design pattern more often (freeing us from the need or temptation to resort to type class hackery with multiple parameter type classes, undecidable instances, functional dependencies etc).
Cheers, Ralf _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Ralf Hinze wrote:
To me replacing a GADT by class and instance declarations seems the wrong way round. We should not forget that the DT in GADT stands for `data type'.
Francois Pottier enumerated some problems with type inference of GADT code during his ICFP'05 invited talk. Various extensions turn out necessary, such as local inference, shape inference, ad hoc elaboration. GADT implementation in GHC 6.4 uses a different extension, wobbly types. However well these new type system extensions may work on their own, they must co-exist with the other features of the language, in particular type classes. The lifting of type class constraints out of the clauses that pattern-match GADT might no longer be a good idea: legitimate code can no longer type check because the combined constraints are contradictory. One may think that GADTs subsume the regular algebraic data types (ADT), but they do not: irrefutable pattern matching is not possible with GADT in their current implementation. Otherwise, unsoundness ensues. One may feel that type classes are more mature and safer.
One could certainly argue that the gist of functional programming is to define a collection of data types and functions that operate on these types. The fact that with GADTs constructors can have more general types just allows us to use the functional design pattern more often
It seems that type classes offer just the same programming pattern, only more type-explicit. Indeed, we can compare code that pattern-match over GADT
empty :: RegExp tok a -> Maybe a empty Zero = mzero empty (Plus r1 r2) = (return Left `ap` empty r1) `mplus` (return Right `ap` empty r2)
with the corresponding type-class based code:
instance RegExp Zero tok Empty where empty _ _ = mzero instance (RegExp r1 tok a, RegExp r2 tok b) => RegExp (Plus r1 r2) tok (Either a b) where empty (Plus r1 r2) t = (liftM Left $ empty r1 t) `mplus` (liftM Right $ empty r2 t)
which does essentially the same pattern-match, and which also associates constraints with various alternatives. Because typeclass-based pattern-matching is `type explicit', we get a better feedback from the typechecker. For example, `type-pattern-match' failure becomes a type error rather than a run-time error. The typeclass-based `pattern matching' is also more expressive, because we can match on higher-order types and the patterns may be non-linear. Regular-expression `patterns' implemented with the help of type classes have more explicit types, e.g., *RX> :t p p :: Star (Mult (Star (Check Char)) (Star (Check Char))) We can easily extend these types with Digit, Alphabetic, etc. I'm glad Martin Sulzmann mentioned extensibility. The day after the generic value parser code was posted, Huong Nguyen asked to extend it for his particular data type:
data HType = C1 { x :: [String] , y :: HType} | C2 {x1 :: String} | C3 {y1 :: Bool} deriving Show
The extension was quite straightforward and required no modifications to the previously written code:
instance Type HType where parse = parseHType instance Type Bool where parse = reads
parseHType = readParen True (C1 <$$> (parseList (unStr <$> parse),parse)) <+> C2 <$> (unStr <$> parse) <+> C3 <$> parse
(f <$$> (p1,p2)) s = [ (f a b, t') | (a, t) <- p1 s, (b,t') <- p2 t ]
Ralf Hinze wrote:
freeing us from the need or temptation to resort to type class hackery with multiple parameter type classes, undecidable instances, functional dependencies etc.
Incidentally none of the parsing code posted recently used undecidable instances let alone functional dependencies. Some of the pieces of code under discussion was Haskell98. One should note that P. J. Stuckey and Martin Sulzmann's paper ``A theory of overloading'' (the journal version) has formalized multi-parameter type classes with functional dependencies and even overlapping instances, and developed their meta-theory. The paper proved various useful properties such as coherence. One can't help but recall the saying by John Reynolds that the pointer XORing trick is no longer a disgusting hack because he had a clean proof of correctness for it. Multi-parameter type classes with functional dependencies and overlapping instances are not a hack!! Conor McBride wrote:
So what's the point? GADTs are a very convenient way to provide data which describe types. Their encoding via type classes, or whatever, may be possible, but this encoding is not necessarily as convenient as the GADT.
First of all, the precise relationship between GADTs and the typeclasses may be a legitimate topic. Do GADTs bring new expressive power or mere convenience? Another interesting question concerns type-checking: if GADTs are so closely related with type classes, there may exist a deep relationship between typeclass overloading resolution (a quite mature topic) and wobbly types, shape inference, etc. I never argued about convenience of GADTs. They can be quite handy when dealing with existentials: GADT embody a safe cast and so spare us form writing the boring casting code ourselves. And perhaps this is the only compelling case for GADTs.

On 10/19/05, oleg@pobox.com
I never argued about convenience of GADTs. They can be quite handy when dealing with existentials: GADT embody a safe cast and so spare us form writing the boring casting code ourselves. And perhaps this is the only compelling case for GADTs.
Speaking about casts, I was playing with using GADTs to create a non-extensible version of Data.Typeable and Data.Dynamic. I wonder if it's possible to write such a thing without GADTs (and unsafeCoerce, which is used in Data.Dynamic, IIRC). BTW, being non-extensible has some benefits, for example, I feel a bit uneasy when I use full blown Dynamics in Haskell (not that I do it that often). See code in the attachment. It has some functions which I didn't find in Data.Dynamics, actually one function: withDyn :: Dyn -> (forall a. Typed a => a -> b) -> b Best regards Tomasz

On Wed, Oct 26, 2005 at 01:37:29PM +0200, Tomasz Zielonka wrote:
On 10/19/05, oleg@pobox.com
wrote: I never argued about convenience of GADTs. They can be quite handy when dealing with existentials: GADT embody a safe cast and so spare us form writing the boring casting code ourselves. And perhaps this is the only compelling case for GADTs.
Speaking about casts, I was playing with using GADTs to create a non-extensible version of Data.Typeable and Data.Dynamic. I wonder if it's possible to write such a thing without GADTs (and unsafeCoerce, which is used in Data.Dynamic, IIRC).
BTW, being non-extensible has some benefits, for example, I feel a bit uneasy when I use full blown Dynamics in Haskell (not that I do it that often).
See code in the attachment. It has some functions which I didn't find in Data.Dynamics, actually one function: withDyn :: Dyn -> (forall a. Typed a => a -> b) -> b
perhaps something like class Typeable a => MyTypeable a newtype MyDynamic = MyDynamic Dynamic instnance MyTypeable Int instance MyTypeable Foo (and add whatever appropriate access methods you like) and then not export the MyTypeable class so new instances could not be declared. John -- John Meacham - ⑆repetae.net⑆john⑈

Hello, You can also write this code in Haskell 98. Jeremy Gibbons and I have recently written a paper entitled "TypeCase: A design pattern for type-indexed functions" where we explore the use of some techniques from lightweight approaches to generic programming (Ralf Hinze and James Cheney's work) for other purposes. If you are interested you can take a look at: http://web.comlab.ox.ac.uk/oucl/work/bruno.oliveira/typecase.pdf Basically, we can use type classes and GADTs as two alternatives approaches to implement type-based case analysis. A well-known example of this is printf. With printf we want to construct a format string that determines the type of printf. Something similar happens here, we want to build a parser based on the type of the regular expression. There is nothing ingenious with this solution --- I basically translated Oleg's and Conor's code. In the paper we discuss in details different trade-offs between different implementations (e.g. GADT vs type classes). For instance, here is something you cannot do (easily) with the GADT approach:
myParser :: MonadPlus m => Parse m tok Int myParser = Parse (\_ -> return (1::Int))
test = asList (parse myParser "ole")
Basically, with the code that follows you can define your own customized Regular expressions that you can use with your parser. (Sorry for the stupid example, but just to give you the idea). Ultimately you can reason on your programs (of this kind!) using GADTs or multiple-parameter type classes with functional dependencies and them translate them into Haskell 98. Here is the code:
module RegExps where
import Monad
newtype Zero = Zero Zero -- Zero type in Haskell 98
class RegExp g where zero :: g tok Zero one :: g tok () check :: (tok -> Bool) -> g tok tok plus :: g tok a -> g tok b -> g tok (Either a b) mult :: g tok a -> g tok b -> g tok (a,b) push :: tok -> g tok r -> g tok r star :: g tok a -> g tok [a]
newtype Parse m tok t = Parse {parse :: [tok] -> m t}
instance MonadPlus m => RegExp (Parse m) where zero = Parse (\_ -> mzero) one = Parse (\l -> case l of [] -> return () _ -> mzero) check p = Parse (\l -> case l of [t] -> if (p t) then return t else mzero _ -> mzero) plus r1 r2 = Parse (\l -> (liftM Left $ parse r1 l) `mplus` (liftM Right $ parse r2 l)) push tok r = Parse (\l -> parse r (tok:l)) mult r1 r2 = Parse (\l -> case l of [] -> liftM2 (,) (parse r1 l) (parse r2 l) (t:ts) -> parse (mult (push t r1) r2) ts `mplus` liftM2 (,) (parse r1 ([] `asTypeOf` ts)) (parse r2 (t:ts))) star r = Parse (\l -> case l of [] -> return [] ts -> parse (mult r (star r)) ts >>= (\(x,xs) -> return $ x : xs))
Problem with the monomorphism restriction
p1 :: RegExp g => g Char ([Char], [Char]) p1 = mult (star (check (== 'a'))) (star (check (== 'b')))
p1 = (Mult (Star (Check (== 'a'))) (Star (Check (== 'b'))))
asMayBe :: Maybe a -> Maybe a asMayBe = id
asList :: [a] -> [a] asList = id
testp = asMayBe $ parse (star (mult (star (check (== 'a'))) (star (check (== 'b'))))) "abaabaaabbbb"
*RX> testp Just [("a","b"),("aa","b"),("aaa","bbbb")] -- see alternative parses
testp1 = take 3 $ asList $ parse (star (mult (star (check (== 'a'))) (star (check (== 'b'))))) "abaabaaabbbb"
-- Parsing the list of integers
ieven = even :: Int->Bool iodd = odd :: Int->Bool
p2 :: RegExp g => g Int (Either (Int, (Int, [Int])) (Int, [Int])) p2 = plus (mult (check iodd) (mult (check iodd) (star (check ieven)))) (mult (check ieven) (star (check iodd)))
-- the parser is ambiguous. We can see the alternatives
test2 = take 3 $ asList $ parse (star p2) [1::Int,1,2,3,3,4]
Connor's code for empty.
{- newtype Empty tok t = Empty {empty :: Maybe t}
instance RegExp Empty where zero = Empty mzero one = Empty (return ()) check _ = Empty mzero plus r1 r2 = Empty ((return Left `ap` empty r1) `mplus` (return Right `ap` empty r2)) mult r1 r2 = Empty (return (,) `ap` empty r1 `ap` empty r2) star _ = Empty (return []) -}

Hi folks
There is nothing ingenious with this solution --- I basically translated Oleg's and Conor's code.
Neither Oleg nor Bruno translated my code; they threw away my structurally recursive on-the-fly automaton and wrote combinator parsers instead. That's why there's no existential, etc. The suggestion that removing the GADT simplifies the code would be better substantiated if like was compared with like. By the way, the combinator parsers, as presented, don't quite work: try asMayBe $ parse (Star One) "I really must get to the bottom of this..." with Oleg's code, or the analogous (star one) with Bruno's. But it's almost as easy a mistake to fix (I hope) as it is to make, so the point that the GADT isn't strictly necessary does stand. I wouldn't swear my code's correct either, but I don't think it loops on finite input. Oleg simulates GADTs with Ye Olde Type Class Trick, getting around the fact that type classes aren't closed by adding each necessary piece of functionality to the methods. By contrast, Bruno goes straight for the Church encoding, as conveniently packaged by type classes: even if you can't write down the (morally) *initial* RegExp algebra, you can write down what RegExp algebras are in general. These two approaches aren't mutually exclusive: you can use Bruno's algebras to capture the canonical functionality which should be supported for all types in Oleg's class. Then you're pretty much writing down that a GADT is an inductive relation, which perhaps you had guessed already... Where we came in, if you recall, was the use of a datatype to define regular expressions. I used a GADT to equip this type with some useful semantic information (namely a type of parsed objects), and I wrote a program (divide) which exploited this representation to compute the regexp the tail of the input must match from the regexp the whole input must match. This may be a bit of a weird way to write a parser, but it's a useful sort of thing to be able to do. There are plenty of other examples of 'auxiliary' types computed from types to which they relate in some interesting way, the Zipper being the a favourite. It's useful to have a syntax (Generic Haskell, GADTs etc) which makes that look as straightforward and concrete as possible. I'm sure the program I actually wrote can be expressed with the type class trick, just by cutting up my functions and pasting the bits into individual instances; the use of the existential is still available. I don't immediately see how to code this up in Bruno's style, but that doesn't mean it can't be done. Still, it might be worth comparing like with like. I suspect that once you start producing values with the relevant properties (as I do) rather than just consuming them (as Oleg and Bruno do), the GADT method might work out a little neater. And that's the point: not only does the GADT have all the disadvantages of a closed datatype, it has all the advantages too! It's easy to write new functions which both consume and produce data, without having to go back and change the definition. I didn't have to define Division or divide in order to say what a regular expression was. If I remember rightly, regular languages are closed under a whole bunch of useful stuff (intersection, complementation, ...) so someone who can remember better than me might implement the corresponding algorithms together with their related operations; they don't need to change the definition of RegExp to do that. Might we actually decide semantic subtyping (yielding a coercion) that way? So what's the point? GADTs are a very convenient way to provide data which describe types. Their encoding via type classes, or whatever, may be possible, but this encoding is not necessarily as convenient as the GADT. Whether or not the various encodings of GADTs deliver more convenience in this example is not clear, because the programs do not correspond. Moreover, once you have to start messing about with equality constraints, the coding to eliminate GADTs becomes a lot hairier. I implemented it back in '95 (when all these things had different names) and I was very glad not to have to do it by hand any more. But your mileage may vary Conor

Hello Conor, I personally think GADTs are a great idea, don't get me wrong. I am not arguing that we should all start programming using on a Church encoding. Neither I am arguing that this encoding can replace GADTs in all cases. Nevertheless, I believe that knowing about this encoding, in the first place, can be extremely valuable for programmers. Let's not forget that neither GADTs or multiple parameter type classes are Haskell 98! In fact, as far as I understood from the Haskell Workshop, we are a long way from those extensions becoming standard. My point is that if you are developing software, then you want to use something standard (this is, running on Hugs, Nhc, Ghc, Ehc, ...). This encoding tries to avoid extensions (not saying that it does for all cases). In the end, that is what most design patterns are: workarounds on missing language features. In this case missing features of Haskell 98 (but not GHC Haskell). Best Regards, Bruno Oliveira

On Tue, Oct 18, 2005 at 05:51:13PM +0100, Conor McBride wrote: | Neither Oleg nor Bruno translated my code; they threw away my | structurally recursive on-the-fly automaton and wrote combinator parsers | instead. That's why there's no existential, etc. The suggestion that | removing the GADT simplifies the code would be better substantiated if | like was compared with like. Here's a version in H98 + existentials. I'm not claiming it proves anything.
module RegExp where
import Control.Monad
data RegExp tok a = Zero | One a | Check (tok -> a) (tok -> Bool) | Plus (RegExp tok a) (RegExp tok a) | forall b. Mult (RegExp tok (b -> a)) (RegExp tok b) | forall e. Star ([e] -> a) (RegExp tok e)
I could just use Prod (One f) instead of fmap f, but the functor instance is mechanical:
instance Functor (RegExp tok) where fmap f Zero = Zero fmap f (One v) = One (f v) fmap f (Check k p) = Check (f.k) p fmap f (Plus r1 r2) = Plus (fmap f r1) (fmap f r2) fmap f (Mult r1 r2) = Mult (fmap (f.) r1) r2 fmap f (Star k r) = Star (f.k) r
Constructors
one :: RegExp tok () one = One ()
check :: (tok -> Bool) -> RegExp tok tok check = Check id
plus :: RegExp tok a -> RegExp tok b -> RegExp tok (Either a b) plus r1 r2 = Plus (fmap Left r1) (fmap Right r2)
mult :: RegExp tok a -> RegExp tok b -> RegExp tok (a,b) mult r1 r2 = Mult (fmap (,) r1) r2
star :: RegExp tok a -> RegExp tok [a] star = Star id
Parsing
parse :: RegExp tok a -> [tok] -> Maybe a parse r [] = empty r parse r (t : ts) = parse (divide t r) ts
empty :: RegExp tok a -> Maybe a empty Zero = mzero empty (One v) = return v empty (Check _ _) = mzero empty (Plus r1 r2) = empty r1 `mplus` empty r2 empty (Mult r1 r2) = empty r1 `ap` empty r2 empty (Star k _) = return (k [])
divide :: tok -> RegExp tok a -> RegExp tok a divide t Zero = Zero divide t (One v) = Zero divide t (Check k p) | p t = One (k t) | otherwise = Zero divide t (Plus r1 r2) = Plus (divide t r1) (divide t r2) divide t (Mult r1 r2) = case empty r1 of Nothing -> Mult (divide t r1) r2 Just f -> Plus (Mult (divide t r1) r2) (fmap f (divide t r2)) divide t (Star k r) = Mult (fmap k_cons (divide t r)) (Star id r) where k_cons x xs = k (x:xs)
participants (8)
-
Bruno Oliveira
-
Conor McBride
-
John Meacham
-
Martin Sulzmann
-
oleg@pobox.com
-
Ralf Hinze
-
Ross Paterson
-
Tomasz Zielonka