Re: [Haskell-cafe] Regular Expressions without GADTs

Conor McBride wrote:
Neither Oleg nor Bruno translated my code; they threw away my structurally recursive on-the-fly automaton and wrote combinator parsers instead. That's why there's no existential, etc. The suggestion that removing the GADT simplifies the code would be better substantiated if like was compared with like. ...
I'm sure the program I actually wrote can be expressed with the type class trick, just by cutting up my functions and pasting the bits into individual instances; the use of the existential is still available. I don't immediately see how to code this up in Bruno's style, but that doesn't mean it can't be done. Still, it might be worth comparing like with like.
Please see the enclosed code. It is still in Haskell98 -- and works in Hugs.
I suspect that once you start producing values with the relevant properties (as I do) rather than just consuming them (as Oleg and Bruno do), the GADT method might work out a little neater.
Actually, the code is pretty much your original code, with downcased identifiers. It faithfully implements that parser division approach. Still, there are no existentials. I wouldn't say that GADT code is so much different. Perhaps the code below is a bit neater due to the absence of existentials, `case' statements, and local type annotations. {-- Haskell98! --} module RegExps where import Monad newtype Zero = Zero Zero -- Zero type in Haskell 98 -- Bruno.Oliveira's type class class RegExp g where zero :: g tok Zero one :: g tok () check :: (tok -> Bool) -> g tok tok plus :: g tok a -> g tok b -> g tok (Either a b) mult :: g tok a -> g tok b -> g tok (a,b) star :: g tok a -> g tok [a] data Parse tok t = Parse { empty :: Maybe t , divide :: tok -> Parse tok t} parse :: Parse tok a -> [tok] -> Maybe a parse p [] = empty p parse p (t:ts) = parse (divide p t) ts liftP f p = Parse{empty = liftM f (empty p), divide = \tok -> liftP f (divide p tok)} liftP2 mf p1 p2 = Parse{empty = mf (empty p1) (empty p2), divide = \tok -> liftP2 mf (divide p1 tok) (divide p2 tok)} lsum x y = (liftM Left x) `mplus` (liftM Right y) lprod x y = liftM2 (,) x y -- Conor McBride's parser division algorithm instance RegExp Parse where zero = Parse mzero (\_ -> zero) one = Parse (return ()) (\_ -> liftP (const ()) zero) check p = Parse mzero (\t -> if p t then liftP (const t) one else liftP (const t) zero) plus r1 r2 = Parse (lsum (empty r1) (empty r2)) (\t -> plus (divide r1 t) (divide r2 t)) mult r1 r2 = Parse (lprod (empty r1) (empty r2)) (\t -> let (q1,q2) = (divide r1 t, divide r2 t) lp x1 = liftP (either id ((,) x1)) in maybe (mult q1 r2) (\x1 -> lp x1 (plus (mult q1 r2) q2)) (empty r1)) star r = Parse (return []) (\t-> liftP (uncurry (:)) (mult (divide r t) (star r))) p1 :: RegExp g => g Char ([Char], [Char]) p1 = mult (star (check (== 'a'))) (star (check (== 'b'))) testp = parse (star (mult (star (check (== 'a'))) (star (check (== 'b'))))) "abaabaaabbbb" {- *RX> testp Just [("a","b"),("aa","b"),("aaa","bbbb")] -} testc = parse (star one) "abracadabra" -- Parsing the list of integers ieven = even :: Int->Bool iodd = odd :: Int->Bool p2 :: RegExp g => g Int (Either (Int, (Int, [Int])) (Int, [Int])) p2 = plus (mult (check iodd) (mult (check iodd) (star (check ieven)))) (mult (check ieven) (star (check iodd))) test2 = parse (star p2) [1::Int,1,2,3,3,4]

oleg@pobox.com wrote:
I suspect that once you start producing values with the relevant properties (as I do) rather than just consuming them (as Oleg and Bruno do), the GADT method might work out a little neater.
Actually, the code is pretty much your original code, with downcased identifiers. It faithfully implements that parser division approach. Still, there are no existentials.
That's because this isn't quite my program either, although I'll grant you it's a lot closer. Give me back the uppercase identifiers and you've pretty much done it. That's to say, I want a datatype of regular expressions which happens (at least) to support parsing, rather than a parser library which supports (at least) regular languages. We can leave it to others to decide which is neater, but I'd remark that...
I wouldn't say that GADT code is so much different. Perhaps the code below is a bit neater due to the absence of existentials, `case' statements, and local type annotations.
...I'm also irritated by the case statements, the local type annotations, and so on. I was rather surprised to find that I needed them. I bet it's not an essential or permanent problem, and it's certainly a problem we have planned not to have in Epigram. The existential also bothers me, but for different reasons. In the dependently typed version, RegExp isn't a GADT: it's just an ordinary datatype, and there's a perfectly ordinary function, Sem :: RegExp -> *, which computes the type corresponding to a regular expression. The syntactic part of division is thus completely separated from the semantic part (glueing the head back on). The GADT version here replaces Sem by an index in the datatype, so sometimes I'm forced to write an existential in order to collect what would have been the result of applying Sem. You've avoided the existential by throwing away the syntax and keeping only the specific empty-and-divide semantics required for this particular example. In effect * I define regular expressions and show they possess at least the empty-and-divide semantics; * You define the empty-and-divide semantics and show that it's closed under at least the regular expression formers (nicely expressed via Bruno's class) Let's see... Here's the class of regular expression algebras:
-- Bruno.Oliveira's type class class RegExp g where zero :: g tok Zero one :: g tok () check :: (tok -> Bool) -> g tok tok plus :: g tok a -> g tok b -> g tok (Either a b) mult :: g tok a -> g tok b -> g tok (a,b) star :: g tok a -> g tok [a]
Here's the specification of the intended semantics:
data Parse tok t = Parse { empty :: Maybe t , divide :: tok -> Parse tok t}
Here's where you explain how stuff other than regular expressions also has the semantics. This is how the existential is made to disappear: you use liftP to fuse the head-glueing function into the parsers themselves. It's something like adding Map :: (s -> t) -> RegExp tok s -> RegExp tok t to the syntax of regular expressions, isn't it?
liftP f p = Parse{empty = liftM f (empty p), divide = \tok -> liftP f (divide p tok)}
And then you show that indeed the desired semantics is closed under RegExp formation.
instance RegExp Parse where
It's a classic instance of the Expression Problem. If you fix the elimination behaviour, it's easy to add new introduction behaviour; if you fix the introduction behaviour, it's easy to add new elimination behaviour. I think it's important to support both modes of working: type classes are good at the former, datatypes the latter, so let's be open to both. There's always a danger with small examples that there's not so much to choose between the two approaches---except that they scale up in completely different ways. It's important not to get into a sterile argument 'Oh, but I can extend the grammar', 'Oh, but I can extend the functionality', as if only one of these is ever important. I will happily cheer anyone who can help us to be lighter on our feet in jumping between the two. I'm not saying there's no value to 'poor man's' alternatives to GADTs. I certainly agree with Bruno (a) that it's important to know how to work within the standard; my experiences implementing the first version of Epigram have convinced me to be much more cautious about which Haskell extensions I'm willing to use in the second (b) that the pattern 'show type constructor f has such-an-such an algebra' is useful and worth abstracting. I was just about to send, when up popped Ross's program, which does give me a datatype, including just enough extra stuff to support functoriality.
data RegExp tok a = Zero | One a | Check (tok -> a) (tok -> Bool) | Plus (RegExp tok a) (RegExp tok a) | forall b. Mult (RegExp tok (b -> a)) (RegExp tok b) | forall e. Star ([e] -> a) (RegExp tok e)
This combines several dodges rather neatly. You can see the traces of the translation from GADT to nested-datatype-with-equations, but instead of saying forall e. Star (Eq [e] a) (RegExp tok e), Ross has kept just the half of the iso he needs in order to extract the parsed value. Throwing away the other half more-or-less allows him to hide the head-glueing functions inside the grammar of the regular expressions. In effect, Map has been distributed through the syntax. It's nice that the syntax of regular expressions survives, but it has been somewhat spun in favour of the functionality required in the example. Of the fakes so far, I like this the best, because it is *data*. I would say that the availability of 'this workaround for this example, that workaround for that example', is evidence in *favour* of adopting the general tools which are designed to support many examples. If a job's worth doing, it's worth doing well. Is there a *uniform* workaround which delivers all the GADT functionality cheaply and cleanly? Kind of subjective, I suppose, but I haven't seen anything cheap enough for me. All the best Conor

On Wed, Oct 19, 2005 at 03:17:41PM +0100, Conor McBride wrote:
I was just about to send, when up popped Ross's program, which does give me a datatype, including just enough extra stuff to support functoriality.
data RegExp tok a = Zero | One a | Check (tok -> a) (tok -> Bool) | Plus (RegExp tok a) (RegExp tok a) | forall b. Mult (RegExp tok (b -> a)) (RegExp tok b) | forall e. Star ([e] -> a) (RegExp tok e)
This combines several dodges rather neatly. You can see the traces of the translation from GADT to nested-datatype-with-equations, but instead of saying forall e. Star (Eq [e] a) (RegExp tok e), Ross has kept just the half of the iso he needs in order to extract the parsed value.
The direction I've kept is the one that matters if everything is covariant, because forall a. F a -> T (G a) ~= forall a, b. (G a -> b) -> F a -> T b if F, G and T are functors. So we can translate data T :: * -> * where C :: F a -> T (G a) to data T b = forall a. C (G a -> b) (F a) and the mechanical translation of your GADT is data RegExp tok a = Zero (Empty -> a) Empty | One (() -> a) () | Check (tok -> a) (tok -> Bool) (RegExp tok a) | forall b c. Plus (Either b c -> a) (RegExp tok b) (RegExp tok c) | forall b c. Mult ((b, c) -> a) (RegExp tok b) (RegExp tok c) | forall e. Star ([e] -> a) (RegExp tok e) A little simplification (ignoring lifted types here and there) yields the version I gave. And of course we're no longer assuming that the function is half of an iso.
Throwing away the other half more-or-less allows him to hide the head-glueing functions inside the grammar of the regular expressions. In effect, Map has been distributed through the syntax.

On Wed, Oct 19, 2005 at 03:17:41PM +0100, Conor McBride wrote:
It's nice that the syntax of regular expressions survives, but it has been somewhat spun in favour of the functionality required in the example.
I'm not sure about that either -- the existential coding supports elimination too, if the result type is functorial (again specializing the general scheme): caseRegExp :: Functor r => r Empty -> r () -> ((tok -> Bool) -> r tok) -> (forall a b. RegExp tok a -> RegExp tok b -> r (Either a b)) -> (forall a b. RegExp tok a -> RegExp tok b -> r (a, b)) -> (forall a. RegExp tok a -> r [a]) -> RegExp tok v -> r v caseRegExp fZero fOne fCheck fPlus fMult fStar x = case x of Zero -> fmap (const bot) fZero One v -> fmap (const v) fOne Check k p -> fmap k (fCheck p) Plus r1 r2 -> fmap (either id id) (fPlus r1 r2) Mult r1 r2 -> fmap (uncurry id) (fMult r1 r2) Star k r -> fmap k (fStar r) where bot = error "can't happen" Not that this is comfortable to use, but it does show that existentials are as expressive, assuming the functor: they can't do RegExp t a -> a -> a.
I would say that the availability of 'this workaround for this example, that workaround for that example', is evidence in *favour* of adopting the general tools which are designed to support many examples. If a job's worth doing, it's worth doing well. Is there a *uniform* workaround which delivers all the GADT functionality cheaply and cleanly? Kind of subjective, I suppose, but I haven't seen anything cheap enough for me.
It seems the existential coding is uniform, if everything is covariant.

Hi Ross Paterson wrote:
On Wed, Oct 19, 2005 at 03:17:41PM +0100, Conor McBride wrote:
It's nice that the syntax of regular expressions survives, but it has been somewhat spun in favour of the functionality required in the example.
I'm not sure about that either -- the existential coding supports elimination too, if the result type is functorial (again specializing the general scheme):
So, you certainly use the weakening of the equality constraint to cache semantic information inside the datatype of regular expressions. Correspondingly, you acquire this need for a functorial result type in the elimination scheme. That's a big ask: normally the result type need only respect equality, which is trivial, morally at least.
caseRegExp :: Functor r => r Empty -> r () -> ((tok -> Bool) -> r tok) -> (forall a b. RegExp tok a -> RegExp tok b -> r (Either a b)) -> (forall a b. RegExp tok a -> RegExp tok b -> r (a, b)) -> (forall a. RegExp tok a -> r [a]) -> RegExp tok v -> r v caseRegExp fZero fOne fCheck fPlus fMult fStar x = case x of Zero -> fmap (const bot) fZero One v -> fmap (const v) fOne Check k p -> fmap k (fCheck p) Plus r1 r2 -> fmap (either id id) (fPlus r1 r2) Mult r1 r2 -> fmap (uncurry id) (fMult r1 r2) Star k r -> fmap k (fStar r) where bot = error "can't happen"
Not that this is comfortable to use, but it does show that existentials are as expressive, assuming the functor: they can't do RegExp t a -> a -> a.
Indeed: you get what you pay for. By the way, if you set things up properly, substituting equal for equal should have no run-time overhead; fmap may well cost...
I would say that the availability of 'this workaround for this example, that workaround for that example', is evidence in *favour* of adopting the general tools which are designed to support many examples. If a job's worth doing, it's worth doing well. Is there a *uniform* workaround which delivers all the GADT functionality cheaply and cleanly? Kind of subjective, I suppose, but I haven't seen anything cheap enough for me.
It seems the existential coding is uniform, if everything is covariant.
Yes, the availability of a uniform coding wasn't in doubt; it's a question of convenience. I'm inclined to agree that things aren't too bad here *yet*, especially if you code up the elimination scheme with the help of Bruno's type class. Where the wheel really starts to wobble is when you need to eliminate a more specific instance of the family. How do you write a function in RegExp tok (Either Empty x) -> MyReturn x? Just eliminating directly leaves you with the problem of getting rid of the non-Plus cases and refining the Plus case. You end up needing a return type loaded with 'Henry Ford' equations newtype LoadedReturn a = MkLoadedReturn (forall x. Eq a (Either Empty x) -> MyReturn x) for some suitable definition of Eq (like (->)?). Never mind the non-Plus cases, the Plus case gives you some RegExp tok a, some RegExp tok b, and a handy fact, viz Eq (Either a b) (Either Empty x), from which you need a MyReturn x. Now the hard work starts. I'm not saying that it's impossible, nor that it's always this hard. I'm just saying that it's suffering unnecessarily to crank this machinery by hand. The only point I'm trying to underline here is that, in general, GADTs yield a non-trivial saving over their various encodings, sufficient to be worth tidying up and keeping. Obviously I'm biased, but I think they're a big step in the right direction. Cheers Conor This message has been checked for viruses but the contents of an attachment may still contain software viruses, which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.
participants (4)
-
Conor McBride
-
Conor McBride
-
oleg@pobox.com
-
Ross Paterson