Design of extremely usable programming language libraries

Dear Cafe, I'm exploring the design space of programming language libraries with enhanced usability and I'd your help and comments. I'll start with a few short questions, but offer a detailed discussion of the motivations and the problems I'm facing below. So, if you have interest in the subject or feel you can offer some insight, please, do read on. * Does any generic traversal/transformation (uniplate-style) library support GADTs? * What is the best choice, performance- and memory-wise, for a parser combinator library with support for arbitrary look-ahead? Parsec is considered slow by some [1], but is it only in comparison with attoparsec (which, unfortunately, doesn't support arbitrary look-ahead)? Is there any parser library that performs better than Parsec while still supporting arbitrary look-ahead. * Any multi-mode pretty printer libraries? By multi-mode I mean writing code once and being able to generate, say, both "pretty" and "minified" text representations of a program by changing just one parameter. Also, what's the most efficient pretty-printing library nowadays? Blaze? I've been using Haskell for quite a while now, primarily, for programming-language applications: program analysis, transformation and compilation. I'm sure many would agree that PL work is where Haskell shines. In the recent years new language features and libraries --namely, GADTs, Template Haskell, quasi-quotation and generic programming--- have appeared that could make working with languages even easier than before. That's why it's sad to see that none of the PL libraries seem to make good use of these features (however, I might be starting to understand why). So, I'm currently exploring the design space for a library that uses these advanced Haskell features for delivering better usability, and I'm having problems with implementing some of them. I welcome comments on both the motivations, overall design and the more technical aspects. I've omitted a few details because it's a long e-mail as is. If something is not clear or doesn't make sense, please, let me know. I'll start by listing the features that an "ideal" PL library should have, and that I've come to cherish as both a heavy user and a developer of such. The basic features (pretty much every library has them) include a parser (text->AST (abstract syntax tree)) and a pretty-printer (AST->text), as well as a Haskell representation of the AST that is somewhat easy to use. Pretty much every library has that --- although some might debate the ease of use of the AST representations. However, there are other features that, in my opinion, are essential to a PL library. The features are motivated by three requirements: static safety (as few run-time errors as possible), minimal code duplication (DRY) and ease of use and inspection of the code. 1) the pretty-printer should be multi-mode. One should be able to write code once and be able to generate different textual representations of the AST: - the "pretty" which is nice to the eye with white spaces, indentations etc. - minified, with minimum white space (while still being valid) - debuggable which inserts comments based on AST annotations - source-map generation - being able to generate colored LaTeX/HTML code would be nice, but non-essential 2) ASTs should be statically safe: you should only be able construct values that represent valid programs, or get a typechecker/compiler error otherwise. Languages that have syntactic productions that can appear in one context but not in another need GADTs with type witnesses to achieve that. In fact, such languages are often used to motivate GADTs in the first place [2]. And while the problem in [2] could have been solved by splitting the Expr datatype into two (IntExpr and BoolExpr), in some languages this can't be done (or produces awkward syntax trees). 3) a quasi-quoter with support for anti-quotation and quoted patterns. This also saves a lot of typing *and* makes your code less error-prone and easier to read. What is better (to both write and read)?
[js|#x# = (function (a, b) {return {t1: a + b, t2: a*b};})(#x#, #y#);|] or ExprStmt def $ AssignExpr def x (CallExpr def (FuncExpr def Nothing [Id def "a", Id def "b"] $ ReturnStmt def $ ObjectLit [(PropId def $ Id def "t1", InfixExpr def OpAdd (VarRef def $ Id def "a") (VarRef def $ Id def "b")), PropId def $ Id def "t2", InfixExpr def OpMul (VarRef def $ Id def "a") (VarRef def $ Id def "b")])) [x, y]
The caveat here is that, to help ensure correctness, the quasi-quoter and the parser should share code as much code as possible. Ideally, there should be just one parser that has a switch for recognizing normal and quasi-quoted programs. However, that would require adding additional constructors representing anti-quotations to our AST. And with that the user might be able to generate invalid AST's and cause a run-time error. The solution to the last problem is to statically constrain all the values that are passed to, say, the pretty-printer so that they are guaranteed to be free of anti-quotes (see an example definition below). However, that, again, requires GADTs (e.g. have all the AST datatypes have an extra type parameter).
data EType = Complete | HasHoles type family Quoted a b :: * canHaveHolesT :: a -> b -> Quoted a b canHaveHolesT _ _ = undefined type instance Quoted HasHoles Complete = HasHoles type instance Quoted Complete HasHoles = HasHoles type instance Quoted HasHoles HasHoles = HasHoles type instance Quoted Complete Complete = HasHoles data Expr t where EInt :: Int -> Expr Complete EAdd :: Expr t1 -> Expr c2 -> Expr (Holes t1 t2) ... EQuote :: String -> Expr HasHoles
And then we could have a normal parser return a value 'Expr Complete' and a quasi-quotation parser retunr a value 'Expr HasHoles'. Similarly, the pretty printer function could have type 'Expr Complete -> Doc'. 4) We should be able to annotate ASTs with arbitrary values, and change the types of those values as we go. The most user friendly way, IMO, is to have the AST datatypes be polymorphic and have that type parameter as an extra field in every constructor. E.g.,
data Expr t a where EInt :: a -> Int -> Expr Complete a EAdd :: a -> Expr t1 a -> Expr c2 a -> Expr (Holes t1 t2) a ... EQuote :: a -> String -> Expr HasHoles a
Then we can use the functions in Traverseable to change types of annotations, and inspect the values by pattern-matching on constructors. 5) support for generic operations on syntax trees. Uniplate, which has been designed to work with ASTs, and is awesome for that purpose because it saves a lot of time. I use transform(Bi) and universe(Bi) all the time and it saves *a lot* of typing. Pretty much all my analysis/transformation code uses those four small-but-powerful function calls -- and, dare I say, it's quite elegant. Other useful, but not crucial features include: 1) diffs for ASTs (in the spirit of the 'gdiff' library, which, alas, doesn't work with polymorphic datatypes) 2) QuickCheck arbitrary instances for ASTs. No technical difficulty there, but writing instances that generate interesting programs and don't run out of memory is quite hard :) I wish 'Agata' was still supported, or there was some library that helps writing Arbitrary instances for ASTs. If you think there's another feature in mind that is missing from the list, please, let me know. The (perceived) challenges in implementing the functionality outlined above are as follows: 1) No multi-mode pretty-printing library. I think that the mutli-mode functionality could be implemented on top of an existing library by definining new combinators, but it would be nice to have a library that supports them out of the box. The particular features that I'm missing are: - "non-essential space/(soft-)line break" combinators that are interpreted as spaces/line-breaks in the "pretty" mode and as empty docs in the "minified" mode. - "comment" combinator which inserts the text in a comment only if the "debug" mode is on - being able to record the positions of AST nodes in the resulting text (for generating source maps). Not sure what would be a convenient interface for that. Note: I know that mainland-pretty has position information, but I don't think it's helpful for generating source maps. 2) The biggest problem is that there are two good reasons to use GADTs when specifying AST datatypes. However, uniplate doesn't work with GADTs and, as far as I know, no currently supported generic programming library does (to be precise, I need support for families of mutually recursive polymorphic GADTs). Am I missing some library, or is my understanding correct? If it's the latter, is there any fundamental limitation that prevents creating such a library? Maybe there are other (but still elegant) ways to satisfy my requirements without using GADTs? 3) 'gdiff' doesn't support polymorphic datatypes. Is there any other library that does? [1] http://www.serpentine.com/blog/2010/03/03/whats-in-a-parsing-library-1/ [2] http://en.wikibooks.org/wiki/Haskell/GADT PS: My attempts so far are in https://github.com/achudnov/language-nextgen/blob/master/Language/Nextgen/Sy... Regards, Andrey Chudnov

* Andrey Chudnov
* Does any generic traversal/transformation (uniplate-style) library support GADTs?
Any syb-style library works with GADTs, by the virtue of dealing with value representations instead of type representations.
* What is the best choice, performance- and memory-wise, for a parser combinator library with support for arbitrary look-ahead? Parsec is considered slow by some [1], but is it only in comparison with attoparsec (which, unfortunately, doesn't support arbitrary look-ahead)? Is there any parser library that performs better than Parsec while still supporting arbitrary look-ahead.
Not sure what you mean here — attoparsec does support unlimited lookahead, in the sense that a parser may fail arbitrarily late in the input stream, and backtrack to any previous state. Although attoparsec is a poor choice for programming language parsing, primarily because of the error messages. Roman

Thanks for a prompt reply, Roman. On 05/28/2013 04:52 PM, Roman Cheplyaka wrote:
Any syb-style library works with GADTs, by the virtue of dealing with value representations instead of type representations. I tried to use syb, but the following code fails to typecheck for me. What am I doing wrong? {-# LANGUAGE GADTs, EmptyDataDecls, MultiParamTypeClasses, TypeFamilies #-} {-# LANGUAGE DeriveDataTypeable, StandaloneDeriving #-}
data HasHoles data Complete deriving instance Typeable HasHoles deriving instance Data HasHoles deriving instance Typeable Complete deriving instance Data Complete type family Holes a b :: * canHaveHolesT :: a -> b -> Holes a b canHaveHolesT _ _ = undefined type instance Holes HasHoles Complete = HasHoles type instance Holes Complete HasHoles = HasHoles type instance Holes HasHoles HasHoles = HasHoles type instance Holes Complete Complete = HasHoles
data Expression k a where EQuote :: a -> String -> Expression HasHoles a IntLit :: a -> Int -> Expression Complete a EArith :: a -> ArithOp -> Expression k1 a -> Expression k2 a -> Expression (Holes k1 k2) a deriving instance Typeable2 (Expression) deriving instance Data (Expression k a) data ArithOp = OpAdd | OpSub | OpMul | OpDiv deriving (Data, Typeable)
Fails with:
Couldn't match type `Complete' with `HasHoles' Expected type: a -> String -> Expression k a Actual type: a -> String -> Expression HasHoles a In the first argument of `z', namely `EQuote' In the first argument of `k', namely `z EQuote' When typechecking the code for `Data.Data.gunfold' in a standalone derived instance for `Data (Expression k a)': To see the code I am typechecking, use -ddump-deriv
Not sure what you mean here — attoparsec does support unlimited lookahead, in the sense that a parser may fail arbitrarily late in the input stream, and backtrack to any previous state. Although attoparsec is a poor choice for programming language parsing, primarily because of the error messages. I guess I have an outdated notion of attoparsec. But yes, error messages seem to be the weak point of attoparsec. Also, the fact that it only accepts bytestrings makes it harder (but no impossible, since we can convert Strings to ByteStrings) to reuse the parser as a QuasiQuoter. So, I'll rephrase my question. What's the best choice for a library for parsing programming languages nowadays?

Not sure what you mean here — attoparsec does support unlimited lookahead, in the sense that a parser may fail arbitrarily late in the input stream, and backtrack to any previous state. Although attoparsec is a poor choice for programming language parsing, primarily because of the error messages. I guess I have an outdated notion of attoparsec. But yes, error messages seem to be the weak point of attoparsec. Also, the fact that it only accepts bytestrings makes it harder (but no impossible, since we can convert Strings to ByteStrings) to reuse the parser as a QuasiQuoter. So, I'll rephrase my question. What's the best choice for a library for parsing programming languages nowadays?
Parsec is still widely popular since it's part of the HP, but I use uu-parsinglib as my first-choice parser. It comes with a lot of examples, good documentation, and many features I like (good error messages and auto error correction). I don't know how performance compares with parsec or attoparsec, but it's always been good enough for me. John L.

Unfortunately you can only do traversals, not unfolds, with GADTs.
That's because in an unfold, the return type is determined by the value
itself and can vary among the produced results, whereas in a traversal
it is determined by the input type.
This means also that you cannot simply derive Data, because the derived
instance will contain a gunfold function, which then will fail to
typecheck.
You can copy-paste the generated instance (-ddump-deriv) and simply
remove the code for gunfold (or write your own deriver). The following
compiles for me:
https://gist.github.com/feuerbach/5668198
Roman
* Andrey Chudnov
Thanks for a prompt reply, Roman.
On 05/28/2013 04:52 PM, Roman Cheplyaka wrote:
Any syb-style library works with GADTs, by the virtue of dealing with value representations instead of type representations. I tried to use syb, but the following code fails to typecheck for me. What am I doing wrong? {-# LANGUAGE GADTs, EmptyDataDecls, MultiParamTypeClasses, TypeFamilies #-} {-# LANGUAGE DeriveDataTypeable, StandaloneDeriving #-}
data HasHoles data Complete deriving instance Typeable HasHoles deriving instance Data HasHoles deriving instance Typeable Complete deriving instance Data Complete type family Holes a b :: * canHaveHolesT :: a -> b -> Holes a b canHaveHolesT _ _ = undefined type instance Holes HasHoles Complete = HasHoles type instance Holes Complete HasHoles = HasHoles type instance Holes HasHoles HasHoles = HasHoles type instance Holes Complete Complete = HasHoles
data Expression k a where EQuote :: a -> String -> Expression HasHoles a IntLit :: a -> Int -> Expression Complete a EArith :: a -> ArithOp -> Expression k1 a -> Expression k2 a -> Expression (Holes k1 k2) a deriving instance Typeable2 (Expression) deriving instance Data (Expression k a) data ArithOp = OpAdd | OpSub | OpMul | OpDiv deriving (Data, Typeable)
Fails with:
Couldn't match type `Complete' with `HasHoles' Expected type: a -> String -> Expression k a Actual type: a -> String -> Expression HasHoles a In the first argument of `z', namely `EQuote' In the first argument of `k', namely `z EQuote' When typechecking the code for `Data.Data.gunfold' in a standalone derived instance for `Data (Expression k a)': To see the code I am typechecking, use -ddump-deriv
Not sure what you mean here — attoparsec does support unlimited lookahead, in the sense that a parser may fail arbitrarily late in the input stream, and backtrack to any previous state. Although attoparsec is a poor choice for programming language parsing, primarily because of the error messages. I guess I have an outdated notion of attoparsec. But yes, error messages seem to be the weak point of attoparsec. Also, the fact that it only accepts bytestrings makes it harder (but no impossible, since we can convert Strings to ByteStrings) to reuse the parser as a QuasiQuoter. So, I'll rephrase my question. What's the best choice for a library for parsing programming languages nowadays?
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (3)
-
Andrey Chudnov
-
John Lato
-
Roman Cheplyaka