Re: [Haskell-cafe] How to do reversible parsing?

4 Jan 2021

      You may be interested in the profunctor technique of bidirectional
parsing/serialisation, a basic version of which is implemented in the
‘codec’ package: https://hackage.haskell.org/package/codec. It was
originally intended for binary and textual data formats, but could be used
for a conventional textual language as well. I have also used this
technique at work to develop a similar library, and found it quite
effective & performant once you get the (somewhat delicate) internals
correct.

As you suggest, it’s generally not feasible to have “encode linearity”
(‘decode . encode = id’; printing omits no information from output) and
“decode linearity” (‘encode . decode = id’; parsing discards no info from
input). However, it’s worth verifying both “encode preservation” (‘encode .
decode . encode = encode’; parsing preserves all info that printing does)
and “decode preservation” (‘decode . encode . decode = decode’; printing
preserves all info that parsing does).

On Sat, Jan 2, 2021 at 4:02 PM Askar Safin via Haskell-Cafe <
haskell-cafe@haskell.org> wrote:
...
Hi. How to do deterministic reversible parsing in Haskell? Please, point
me to some libs or tools.
Or give some advice on how to write such lib or tool. (Also, I am
interested in solutions for C or C++.)
Now let me explain my task in detail.
I am interested in formal math. I am going to develop various proof
checkers. In particular, I want
to develop my PTS checker ( https://en.wikipedia.org/wiki/Pure_type_system
). Here is example of
existing PTS checker:
http://web.archive.org/web/20180924133846/http://www.cs.kun.nl/~janz/yarrow/
.
Here is example of input this checker accepts (taken from manual):
Var (+) : Nat -> Nat -> Nat
Def double := \x:Nat. x+x
Def S := (+) one
Also, I want to develop something similar to Isabelle (
https://isabelle.in.tum.de/ ). Example of
Isabelle input: https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html .
Also, I am going to write converters between various languages. For
example, between input for some
existing PTS checker and my PTS checker.
In short, I want to write lot of parsers and pretty-printers for various
languages. And thus I need
some parsing solution.
The solution should have the following features:
1. The parsing should be divided into lexing and parsing. I. e. usual lex
+ yacc model. Parsing should
be done using usual context-free grammars (CFGs). Left-recursive grammars
should be supported (this
leaves out parser combinator libraries). All languages I am going to deal
with fit this model. I. e.
I will deal with simple to parse languages, such as Pascal. Not complex
ones, like C or C++.
2. Parsing should be deterministic. I. e. parser should check that CFG is
unambiguous. Yes, I know
that this task is impossible to do on Turing machine. But at very least
some partial solution will
go. For example, some algorithm, which will check that some reasonable set
of CFGs is deterministic.
Or even brute force up to some limit will go. (Yes, this will not give
absolutely correct results,
but this will go.) This is example of grammar for PTS in one of my
checkers (start symbol is t0):
t1000 = id
t1000 = "(" t0")"
t999 = t999 t1000
t3 = t4 "::" t3
t3 = "%" id "::" t0 "." t3
t0 = "!!" id "::" t0 "." t0
t1 = t2 "==>" t1
t0 = t1
t1 = t2
t2 = t3
t3 = t4
t4 = t999
t999 = t1000
If I remember correctly, this grammar is deterministic (checked using
brute force up to a limit).
I want solution which will verify that this grammar is deterministic.
All stages of parsing should be proved as deterministic (i. e. scanner
too).
3. Parsing should be reversible. I. e. pretty-printing should be
supported, too. And the tool should
guarantee that parsing and pretty-printing match. Now let me give more
precise definition. Let's
assume we have these functions:
parse :: String -> Maybe X
print :: X -> Maybe String
As you can see, not every internal representation is printable. This is
OK. For example, if internal
representation contains variable names that happen to equal reserved
words, then, of course, this is
unprintable representation.
The tool should guarantee that:
a. Result of parsing should be printable, but printing not necessary
should give same string. I. e.
if "parse s == Just x", then "print x /= Nothing". (But not necessary
"print x == Just s".)
b. Parsing of result of printing should give same representation. I. e. if
"print x == Just s", then
"parse s == Just x"
4. Not only parsing should generate parse tree (i. e. exact representation
of derivation), but also
AST. AST differs from parse tree. For example, AST should not contain
braces. I. e. "2 + (3 * 4)" and
"2 + 3 * 4" should give same AST. And the tool should be able parse string
to AST and pretty print
AST back to string.
5. It should be possible to use CFG not known in compile time. Consider
Isabelle language. Here is
again example of Isabelle text:
https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html . As you can
see the language is divided into inner language and outer language.
Strings in inner language are
quoted. Outer language can add productions to inner language. Inner
language cannot change itself.
"notation" and "translations" commands add productions to inner language.
So inner language is not
known beforehand, it is created dynamically. There is no need to
dynamically create scanner, i. e.
it should be possible to dynamically create parser (for inner language),
but there is no such
requirement for scanner. Also, parser will not change in the middle of
parsing given string.
Again: let's assume I write parser for my Isabelle analog. Scanner and
parser for outer language is
fixed. Scanner for inner language is fixed, too. But parser for inner
language is not. I read input
document and add productions to CFG of inner language on the fly. Every
time I add production,
resulting CFG should be checked for ambiguity. And when I find inner
language string in input, I
process it using just constructed parser for inner language.
--
Points 2 and 3 are the most important in this list. If there is solution
which contains them alone,
this already will be very good.
Now let me list considered solutions.
* Usual combinator libraries (such as parsec) will not go. They don't work
with left-recursive CFGs.
They don't check grammars for ambiguities. They don't do reversible
parsing. But they allow creating
language on the fly, and this is good thing.
* Happy and alex will not go. As well as I know Happy can prove
unambiguity of some very simple grammars
(i. e. for some set of simple grammars it will not report any conflicts
and thus will prove they unambiguity).
It seems this is LR(1) set. But, as well as I know, Happy will report
conflicts on PTS grammar given
above, i. e. Happy cannot prove its unambiguity. Also, Happy and Alex
cannot pretty-print. And Happy
and Alex don't allow supplying CFG on the fly.
* Package "earley". Handle arbitrary CFGs, including left-recursive. This
is good thing. It seems CFG
can be supplied in runtime. This is good thing, too. Gives all parse
trees, thus we can verify that
count of parse trees equals to 1. This is good thing, too. Unfortunately,
earley cannot verify CFS's
unambiguity before supplying input. Also, earley doesn't support
pretty-printing.
* PEG libraries. Will not go, I want CFGs.
* I searched on Hackage reversible parsing libraries. Unfortunately, none
of them guarantee
reversibility. I. e. it is very easy to fool these libraries and create
parsers and printers not
satisfying equations I gave above. Also, this libraries doesn't have other
features I mentioned
* http://augeas.net/ is very cool project. Unlike nearly everything I
found in internet, Augeas
provides guaranteed reversible parsing and pretty-printing (!!!!!).
Unfortunately, set of supported
target languages is very limited. Regular languages are supported. And
some very limited set of CFGs
is supported. For example, it is possible to construct parser for JSON.
But parser for Pascal seems
impossible. Augeas not designed for parsing programming languages. Only
simple configuration
languages like JSON, XML, etc. Mutually recursive nonterminals are not
supported. Also, Augeas
provides C API, not Haskell one.
* https://www.brics.dk/xsugar.html is very good. It provides guaranteed
reversible translation
between arbitrary language and XML. And XSugar even verifies unambiguity
of the CFG using some
sophisticated algorithm. Unfortunately, this algorithm still cannot prove
unambiguity of PTS grammar
given above. Also, XSugar is not Haskell library. It is standalone Java
app. So, I need to convert
some language to XML, then load this XML to my Haskell program and parse.
Also, there is similar lib
https://www.brics.dk/Xact.html from same author, but it is for Java, not
for Haskell.
I was able to find only 2 projects, which do guaranteed reversible
parsing: mentioned Augeas and
XSugar/Xact. Unfortunately, both are not for Haskell. And both cannot
prove unambiguity for my PTS
grammar.
So, please tell me about solution. Or give advice how to write one.
Also, let me describe my experience with various algorithms for checking
ambiguity of grammar.
1. Try to construct LR(1) table. If there is no conflicts, then the
grammar is unambiguous. It seems
this is algorithm used by bison and happy. Unfortunately it is too weak
and cannot prove unambiguity
of my PTS grammar.
2. Try to construct LR(k) table. This algorithm is stronger than LR(1). I
was unable to find
implementation of this algorithm in any language. I don't want to write
such algorithm.
3. Schmitz's algorithm. http://www.lsv.fr/~schmitz/pub/expamb.pdf . Even
stronger than LR(k). I
downloaded his bison fork. Unfortunately this bison fork can prove
unambiguity for same set of
grammars original bison can (i. e. LR(1)). I. e. it seems Schmitz did not
implemented his algorithm.
I don't want to implement his algorithm either.
4. Horizontal and vertical ambiguity checking.
https://www.brics.dk/grammar/kruse.pdf . This is
algorithm used in XSugar. Unfortunately, it doesn't handle my PTS grammar.
5. Brute force up to some limit. Doesn't give unambiguity guarantee. But I
like it, and I will use
it if I don't find anything else. It is the only algorithm which can
handle my PTS grammar (and
give enough amount of guarantee), not counting LR(k) and Schmitz's
algorithm (I was not able to find
implementations of both).
I will try to write tool I want, I will write about my progress to this
list.
==
Askar Safin
https://github.com/safinaskar
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

Re: [Haskell-cafe] How to do reversible parsing?

Jon Purdy