
You may be interested in the profunctor technique of bidirectional parsing/serialisation, a basic version of which is implemented in the ‘codec’ package: https://hackage.haskell.org/package/codec. It was originally intended for binary and textual data formats, but could be used for a conventional textual language as well. I have also used this technique at work to develop a similar library, and found it quite effective & performant once you get the (somewhat delicate) internals correct. As you suggest, it’s generally not feasible to have “encode linearity” (‘decode . encode = id’; printing omits no information from output) and “decode linearity” (‘encode . decode = id’; parsing discards no info from input). However, it’s worth verifying both “encode preservation” (‘encode . decode . encode = encode’; parsing preserves all info that printing does) and “decode preservation” (‘decode . encode . decode = decode’; printing preserves all info that parsing does). On Sat, Jan 2, 2021 at 4:02 PM Askar Safin via Haskell-Cafe < haskell-cafe@haskell.org> wrote:
Hi. How to do deterministic reversible parsing in Haskell? Please, point me to some libs or tools. Or give some advice on how to write such lib or tool. (Also, I am interested in solutions for C or C++.)
Now let me explain my task in detail.
I am interested in formal math. I am going to develop various proof checkers. In particular, I want to develop my PTS checker ( https://en.wikipedia.org/wiki/Pure_type_system ). Here is example of existing PTS checker: http://web.archive.org/web/20180924133846/http://www.cs.kun.nl/~janz/yarrow/ . Here is example of input this checker accepts (taken from manual):
Var (+) : Nat -> Nat -> Nat Def double := \x:Nat. x+x Def S := (+) one
Also, I want to develop something similar to Isabelle ( https://isabelle.in.tum.de/ ). Example of Isabelle input: https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html .
Also, I am going to write converters between various languages. For example, between input for some existing PTS checker and my PTS checker.
In short, I want to write lot of parsers and pretty-printers for various languages. And thus I need some parsing solution.
The solution should have the following features:
1. The parsing should be divided into lexing and parsing. I. e. usual lex + yacc model. Parsing should be done using usual context-free grammars (CFGs). Left-recursive grammars should be supported (this leaves out parser combinator libraries). All languages I am going to deal with fit this model. I. e. I will deal with simple to parse languages, such as Pascal. Not complex ones, like C or C++.
2. Parsing should be deterministic. I. e. parser should check that CFG is unambiguous. Yes, I know that this task is impossible to do on Turing machine. But at very least some partial solution will go. For example, some algorithm, which will check that some reasonable set of CFGs is deterministic. Or even brute force up to some limit will go. (Yes, this will not give absolutely correct results, but this will go.) This is example of grammar for PTS in one of my checkers (start symbol is t0):
t1000 = id t1000 = "(" t0")" t999 = t999 t1000 t3 = t4 "::" t3 t3 = "%" id "::" t0 "." t3 t0 = "!!" id "::" t0 "." t0 t1 = t2 "==>" t1 t0 = t1 t1 = t2 t2 = t3 t3 = t4 t4 = t999 t999 = t1000
If I remember correctly, this grammar is deterministic (checked using brute force up to a limit). I want solution which will verify that this grammar is deterministic.
All stages of parsing should be proved as deterministic (i. e. scanner too).
3. Parsing should be reversible. I. e. pretty-printing should be supported, too. And the tool should guarantee that parsing and pretty-printing match. Now let me give more precise definition. Let's assume we have these functions:
parse :: String -> Maybe X print :: X -> Maybe String
As you can see, not every internal representation is printable. This is OK. For example, if internal representation contains variable names that happen to equal reserved words, then, of course, this is unprintable representation.
The tool should guarantee that: a. Result of parsing should be printable, but printing not necessary should give same string. I. e. if "parse s == Just x", then "print x /= Nothing". (But not necessary "print x == Just s".) b. Parsing of result of printing should give same representation. I. e. if "print x == Just s", then "parse s == Just x"
4. Not only parsing should generate parse tree (i. e. exact representation of derivation), but also AST. AST differs from parse tree. For example, AST should not contain braces. I. e. "2 + (3 * 4)" and "2 + 3 * 4" should give same AST. And the tool should be able parse string to AST and pretty print AST back to string.
5. It should be possible to use CFG not known in compile time. Consider Isabelle language. Here is again example of Isabelle text: https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html . As you can see the language is divided into inner language and outer language. Strings in inner language are quoted. Outer language can add productions to inner language. Inner language cannot change itself. "notation" and "translations" commands add productions to inner language. So inner language is not known beforehand, it is created dynamically. There is no need to dynamically create scanner, i. e. it should be possible to dynamically create parser (for inner language), but there is no such requirement for scanner. Also, parser will not change in the middle of parsing given string.
Again: let's assume I write parser for my Isabelle analog. Scanner and parser for outer language is fixed. Scanner for inner language is fixed, too. But parser for inner language is not. I read input document and add productions to CFG of inner language on the fly. Every time I add production, resulting CFG should be checked for ambiguity. And when I find inner language string in input, I process it using just constructed parser for inner language.
--
Points 2 and 3 are the most important in this list. If there is solution which contains them alone, this already will be very good.
Now let me list considered solutions.
* Usual combinator libraries (such as parsec) will not go. They don't work with left-recursive CFGs. They don't check grammars for ambiguities. They don't do reversible parsing. But they allow creating language on the fly, and this is good thing.
* Happy and alex will not go. As well as I know Happy can prove unambiguity of some very simple grammars (i. e. for some set of simple grammars it will not report any conflicts and thus will prove they unambiguity). It seems this is LR(1) set. But, as well as I know, Happy will report conflicts on PTS grammar given above, i. e. Happy cannot prove its unambiguity. Also, Happy and Alex cannot pretty-print. And Happy and Alex don't allow supplying CFG on the fly.
* Package "earley". Handle arbitrary CFGs, including left-recursive. This is good thing. It seems CFG can be supplied in runtime. This is good thing, too. Gives all parse trees, thus we can verify that count of parse trees equals to 1. This is good thing, too. Unfortunately, earley cannot verify CFS's unambiguity before supplying input. Also, earley doesn't support pretty-printing.
* PEG libraries. Will not go, I want CFGs.
* I searched on Hackage reversible parsing libraries. Unfortunately, none of them guarantee reversibility. I. e. it is very easy to fool these libraries and create parsers and printers not satisfying equations I gave above. Also, this libraries doesn't have other features I mentioned
* http://augeas.net/ is very cool project. Unlike nearly everything I found in internet, Augeas provides guaranteed reversible parsing and pretty-printing (!!!!!). Unfortunately, set of supported target languages is very limited. Regular languages are supported. And some very limited set of CFGs is supported. For example, it is possible to construct parser for JSON. But parser for Pascal seems impossible. Augeas not designed for parsing programming languages. Only simple configuration languages like JSON, XML, etc. Mutually recursive nonterminals are not supported. Also, Augeas provides C API, not Haskell one.
* https://www.brics.dk/xsugar.html is very good. It provides guaranteed reversible translation between arbitrary language and XML. And XSugar even verifies unambiguity of the CFG using some sophisticated algorithm. Unfortunately, this algorithm still cannot prove unambiguity of PTS grammar given above. Also, XSugar is not Haskell library. It is standalone Java app. So, I need to convert some language to XML, then load this XML to my Haskell program and parse. Also, there is similar lib https://www.brics.dk/Xact.html from same author, but it is for Java, not for Haskell.
I was able to find only 2 projects, which do guaranteed reversible parsing: mentioned Augeas and XSugar/Xact. Unfortunately, both are not for Haskell. And both cannot prove unambiguity for my PTS grammar.
So, please tell me about solution. Or give advice how to write one.
Also, let me describe my experience with various algorithms for checking ambiguity of grammar.
1. Try to construct LR(1) table. If there is no conflicts, then the grammar is unambiguous. It seems this is algorithm used by bison and happy. Unfortunately it is too weak and cannot prove unambiguity of my PTS grammar. 2. Try to construct LR(k) table. This algorithm is stronger than LR(1). I was unable to find implementation of this algorithm in any language. I don't want to write such algorithm. 3. Schmitz's algorithm. http://www.lsv.fr/~schmitz/pub/expamb.pdf . Even stronger than LR(k). I downloaded his bison fork. Unfortunately this bison fork can prove unambiguity for same set of grammars original bison can (i. e. LR(1)). I. e. it seems Schmitz did not implemented his algorithm. I don't want to implement his algorithm either. 4. Horizontal and vertical ambiguity checking. https://www.brics.dk/grammar/kruse.pdf . This is algorithm used in XSugar. Unfortunately, it doesn't handle my PTS grammar. 5. Brute force up to some limit. Doesn't give unambiguity guarantee. But I like it, and I will use it if I don't find anything else. It is the only algorithm which can handle my PTS grammar (and give enough amount of guarantee), not counting LR(k) and Schmitz's algorithm (I was not able to find implementations of both).
I will try to write tool I want, I will write about my progress to this list.
== Askar Safin https://github.com/safinaskar _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.