[Haskell-cafe] Design of extremely usable programming language libraries

28 May 2013

      Dear Cafe,
I'm exploring the design space of programming language
libraries with enhanced usability and I'd your help and comments.
I'll start with a few short questions, but offer a detailed discussion
of the motivations
and the problems I'm facing below. So, if you have interest in
the subject or feel you can offer some insight, please, do read on.

* Does any generic traversal/transformation (uniplate-style) library
  support GADTs?
* What is the best choice, performance- and memory-wise, for a parser
  combinator  library with support for arbitrary look-ahead? Parsec is
  considered slow by some [1], but is it only in comparison with
  attoparsec (which, unfortunately, doesn't support arbitrary
  look-ahead)? Is there any parser library that performs better than
  Parsec while still supporting arbitrary look-ahead.
* Any multi-mode pretty printer libraries? By multi-mode I mean
  writing code once and being able to generate, say, both "pretty" and
  "minified" text representations of a program by changing just one
  parameter. Also, what's the most efficient pretty-printing library
  nowadays? Blaze?

I've been using Haskell for quite a while now, primarily, for
programming-language applications: program analysis, transformation
and compilation. I'm sure many would agree that PL work is where
Haskell shines. In the recent years new language features and
libraries --namely, GADTs, Template Haskell, quasi-quotation and
generic programming--- have appeared that could make working with
languages even easier than before. That's why it's sad to see that
none of the PL libraries seem to make good use of these features
(however, I might be starting to understand why). So, I'm currently
exploring
the design space for a library that uses these advanced Haskell features for
delivering better usability, and I'm having problems with implementing some
of them. I welcome comments on both the motivations, overall design and
the more technical aspects. I've omitted a few details because it's a long
e-mail as is. If something is not clear or doesn't make sense, please,
let me know.

I'll start by listing the features that an "ideal" PL library should
have, and that I've come to cherish as both a heavy user and a
developer of such. The basic features (pretty much every library has
them) include a parser (text->AST (abstract syntax tree)) and a
pretty-printer (AST->text), as well as a Haskell representation of the
AST that is somewhat easy to use. Pretty much every library has that
--- although some might debate the ease of use of the AST
representations.

However, there are other features that, in my opinion, are essential
to a PL library. The features are motivated by three requirements:
static safety (as few run-time errors as possible), minimal code
duplication (DRY) and ease of use and inspection of the code.

1) the pretty-printer should be multi-mode. One should be able to write
code once and be able to generate different textual representations of
the AST:
         - the "pretty" which is nice to the eye with white spaces,
           indentations etc.
         - minified, with minimum white space (while still being valid)
         - debuggable which inserts comments based on AST annotations
         - source-map generation
         - being able to generate colored LaTeX/HTML code would be
           nice, but non-essential

2) ASTs should be statically safe: you should only be able construct
values that represent valid programs, or get a typechecker/compiler
error otherwise. Languages that have syntactic productions that can
appear in one context but not in another need GADTs with type witnesses to
achieve that. In fact, such languages are often used to motivate GADTs
in the first place [2]. And while the problem in [2] could have been
solved by splitting the Expr datatype into two (IntExpr and BoolExpr),
in some languages this can't be done (or produces awkward syntax
trees).

3) a quasi-quoter with support for anti-quotation and quoted
patterns. This also saves a lot of typing *and* makes your code less
error-prone and easier to read. What is better (to both write and read)?
...
[js|#x# = (function (a, b) {return {t1: a + b, t2: a*b};})(#x#, #y#);|]
or
ExprStmt def $ AssignExpr def x (CallExpr def (FuncExpr def Nothing
[Id def "a", Id def "b"] $ ReturnStmt def $ ObjectLit [(PropId def $
Id def "t1", InfixExpr def OpAdd (VarRef def $ Id def "a") (VarRef def $
Id def "b")), PropId def $
Id def "t2", InfixExpr def OpMul (VarRef def $ Id def "a") (VarRef def $
Id def "b")])) [x, y]
The caveat here is that, to help ensure correctness,
the quasi-quoter and the parser should share code as much code as
possible. Ideally, there should be just one parser that has a switch
for recognizing normal and quasi-quoted programs. However, that would
require adding additional constructors representing anti-quotations to
our AST. And with that the user might be able to generate invalid
AST's and cause a run-time error. The solution to the last problem is
to statically constrain all the values that are passed to, say, the
pretty-printer so that they are guaranteed to be free of
anti-quotes (see an example definition below). However, that, again,
requires GADTs (e.g. have all the AST datatypes have an extra type
parameter).
...
data EType  = Complete | HasHoles
type family Quoted a b :: *
canHaveHolesT :: a -> b -> Quoted a b
canHaveHolesT _ _ = undefined
type instance Quoted HasHoles Complete = HasHoles
type instance Quoted Complete HasHoles = HasHoles
type instance Quoted HasHoles HasHoles = HasHoles
type instance Quoted Complete Complete = HasHoles
data Expr t where
  EInt :: Int -> Expr Complete
  EAdd :: Expr t1 -> Expr c2 -> Expr (Holes t1 t2)
  ...
  EQuote :: String -> Expr HasHoles
And then we could have a normal parser return a value 'Expr Complete'
and a quasi-quotation parser retunr a value 'Expr
HasHoles'. Similarly, the pretty printer function could have type
'Expr Complete -> Doc'.

4) We should be able to annotate ASTs with arbitrary values, and
change the types of those values as we go. The most user friendly way,
IMO, is to have the AST datatypes be polymorphic and have that type
parameter as an extra field in every constructor. E.g.,
...
data Expr t a where
  EInt :: a -> Int -> Expr Complete a
  EAdd :: a -> Expr t1 a -> Expr c2 a -> Expr (Holes t1 t2) a
  ...
  EQuote :: a -> String -> Expr HasHoles a
Then we can use the functions in Traverseable to change types of
annotations,
and inspect the values by pattern-matching on constructors.

5) support for generic operations on syntax trees. Uniplate, which has
been designed to work with ASTs, and is awesome for that purpose
because it saves a lot of time. I use
transform(Bi) and universe(Bi) all the time and it saves *a lot* of
typing. Pretty much all my analysis/transformation code uses those four
small-but-powerful function calls -- and, dare I say, it's quite
elegant.

Other useful, but not crucial features include:

1) diffs for ASTs (in the spirit of the 'gdiff' library, which, alas,
doesn't work with polymorphic datatypes)

2) QuickCheck arbitrary instances for ASTs. No technical difficulty
there, but writing instances that generate interesting programs and
don't run out of memory is quite hard :) I wish 'Agata' was still
supported, or there was some library that helps writing Arbitrary instances
for ASTs.

If you think there's another feature in mind that is missing from the
list, please, let me know.

The (perceived) challenges in implementing the functionality outlined
above are as follows:

1) No multi-mode pretty-printing library. I think that the mutli-mode
   functionality could be implemented on top of an existing library by
   definining new combinators, but it would be nice to have a library
   that supports them out of the box. The particular features that I'm
   missing are:
   - "non-essential space/(soft-)line break" combinators that are
     interpreted as spaces/line-breaks in the "pretty" mode and as empty
     docs in the "minified" mode.
   - "comment" combinator which inserts the text in a comment only if the
     "debug" mode is on
   - being able to record the positions of AST nodes in the resulting
     text (for generating source maps). Not sure what would be a
     convenient interface for that. Note: I know that mainland-pretty has
     position information, but I don't think it's helpful for generating
     source maps.

2) The biggest problem is that there are two good reasons to use GADTs
   when specifying AST datatypes.  However, uniplate doesn't work with
   GADTs and, as far as I know, no currently supported generic
   programming library does (to be precise, I need support for
   families of mutually recursive polymorphic GADTs). Am I missing
   some library, or is my understanding correct? If it's the latter,
   is there any fundamental limitation that prevents creating such a
   library?  Maybe there are other (but still elegant) ways to satisfy
   my requirements without using GADTs?

3) 'gdiff' doesn't support polymorphic datatypes. Is there any other
    library that does?

[1] http://www.serpentine.com/blog/2010/03/03/whats-in-a-parsing-library-1/
[2] http://en.wikibooks.org/wiki/Haskell/GADT

PS: My attempts so far are in
https://github.com/achudnov/language-nextgen/blob/master/Language/Nextgen/Sy...

Regards,
Andrey Chudnov