New subject: [ANN] parser-unbiased-choice-monad-embedding - the best parsing library; it is based on arrows (was: Pearl! I just proved theorem about impossibility of monad transformer for parsing with (1) unbiased choice and (2) ambiguity checking before running embedded monadic action (also, I THREAT I will create another parsing lib))

8 Jul 2021

      Hi.

I announce my parsing library https://hackage.haskell.org/package/parser-unbiased-choice-monad-embedding .
I think it is best parsing library, and you should always use it instead of other solutions. I will tell
you why. You may check comparison table: https://paste.debian.net/1203863/ (if you don't
understand the table, don't worry; come back to the table after reading this mail).

My library is solution to problem described in this e-mail thread, so read it for motivation:
https://mail.haskell.org/pipermail/haskell-cafe/2021-June/134094.html .

Now let me describe my solution in detail.

I will distinguish parser errors (i. e. "no parse" or "ambiguous parse") and semantic errors
(i. e. "division by zero", "undefined identifier", "type mismatch", etc).

So, now I will show you parser with unbiased choice, which allows monad embedding.

As a very good introduction to arrows I recommend this:
https://ocharles.org.uk/guest-posts/2014-12-21-arrows.html .

We start from classic parser with this type:

newtype ParserClassic t a = ParserClassic ([t] -> [(a, [t])])

You can make it be instance of Functor, Applicative, Monad and Alternative.

This type is similar to ReadS from base
( https://hackage.haskell.org/package/base-4.15.0.0/docs/Text-ParserCombinator... )
and to "type Parser = StateT String []" from example here:
https://hackage.haskell.org/package/transformers-0.5.5.0/docs/Control-Monad-... .
I will not give more information, feel free to find it in internet.

Now let's replace "a" with Kleisli arrow "b -> m c". We will get this:

newtype Parser1 t m b c = Parser1 ([t] -> [(b -> m c, [t])])

Here is resulting parsing library and example: https://godbolt.org/z/qsrdKefjT (backup:
https://paste.debian.net/1203861/ ). We can use this parser in Applicative style. And when we
need to lift something to embedded monad, we resort to Arrow style. I didn't test this code much.

Parser1 cannot be Monad (I proved this in previous letters).

At this point someone may ask: why we need arrows? Can we achieve same effect using
Applicative only? Yes, we can. Here is result: https://godbolt.org/z/ocY3csWjs (backup:
https://paste.debian.net/1203862/ ), I will call such method "anti-arrow". But we have two
problems here: first, I think this method makes parser code uglier. Second, this method is
limited, and I will show you why later.

Okey, now back to arrows.

Still, we have this problems (I'm about our arrow code):
- Handling left-recursive grammars is tricky (it is possible, but not as simple as right-recursive
ones)
- Parsing errors (as opposed to semantic error messages) are not great. I. e. if there is no valid
parses, we don't get any additional information, i. e. we don't know which token caused error
- We don't track locations, so semantic errors are not great, too (i. e. we want to have location
info to embed it into semantic error messages)
- I suspect this parsing library will have very bad speed, possibly O(exp(input size))

So, let's combine our ideas with package Earley ( https://hackage.haskell.org/package/Earley ).
Earley has type "Prod r e t a". Let's replace "a" with Kleisli arrow "b -> m c".
Also let's wrap this Kleisli arrow to "L" from srcloc ( https://hackage.haskell.org/package/srcloc )
to get location info. Also, we wrap "t" to "L", too.

Thus we get this type:

newtype ArrowProd r e t m b c = ArrowProd (Prod r e (L t) (L (b -> m c)))

Here is resulting library with example: https://hackage.haskell.org/package/parser-unbiased-choice-monad-embedding .
(I didn't test much.) (I recommend to read docs to Earley first.) (My library is designed to be used
with some external lexer, lexer-applicative is recommended.)

The example also uses lexer-applicative, because lexer-applicative automatically wraps tokens into "L".

So, what we got? We solved original goals, i. e.:
- We have combinator parser with unbiased choice. Unfortunately, it is not monadic, but it is
Applicative and Arrow
- We can embed monad, for example, to handle semantic errors
- We can test parsing errors before executing embedded monadic action

Additionally we solved 4 remaining problems mentioned above, i. e.:
- Handling left-recursive grammars is as simple as right-recursive ones (thanks to Earley's
RecursiveDo)
- Parsing errors are ok
- We track locations and we can embed them into semantic errors
- We have relatively good speed thanks to Earley

What else? We can test grammar for ambiguity using
https://hackage.haskell.org/package/Earley-0.13.0.1/docs/Text-Earley.html#v:... (I didn't wrap
this function, but this can easily be done).

Personally I think that my parsing library is simply best. :) And that one should always use it
instead of all others. Why I think this? Because:
- We all know that CFG is better than PEG, i. e. unbiased choice is better that biased
- I don't want to merely produce AST, I want to process information while I parse, and this
processing will uncover some semantic errors
- So I want unbiased choice with handling semantic errors
- The only existing solution, which can do this is my library (and also "happy")
- But "happy" doesn't automatically track locations (as well as I know it tracks lines, but not columns)
- So the best parsing solution is my library :)

My library has another advantage over happy: by extensive use of Alternative's "many" (and
arrow banana brackets) you can write things, which are impossible to write in happy. Consider
this artificial example (subset of Pascal):
---
var
  a: integer; b: integer;
begin
  a := b + c;
end.
---
This is its parser, which uses my library (completely untested):
---
arrowRule (proc () -> do {
  sym TVar -< ();
  declated <- many (ident <* sym TColon <* sym TInteger <* sym TSemicolon) -< ();
  sym TBegin -< ();

  -- Banana brackets
  (|many (do {
    x <- ident -< ();
    lift -< when (x `notElem` declated) $ Left "Undeclated identifier";
    sym TAssign -< ();

    -- My library doesn't have "sepBy", but it can be easily created
    (|sepBy (do {
      y <- ident -< ();
      lift -< when (y `notElem` declated) $ Left "Undeclated identifier";
      returnA -< ();
    })|) [TPlus];
    sym TSemicolon -< ();
    returnA -< ();
  |);
  sym TEnd -< ();
  sym TDot -< ();
  returnA -< ();
})
---
This is impossible to write similar code using "happy" with similar ergonomics, because in
happy we would not have access to "declared" inside production for sum. The only way we
can access "declared" is using state monad (for example, StateT) and put this "declared"
into this state monad. But in my library you don't have to use any StateT!

Now let me say why mentioned "anti-arrow" style is not enough. Let's try to rewrite this
Pascal example using anti-arrow style:
---
antiArrowLift $ do { -- ApplicativeDo
  sym TVar;
  declated <- many (ident <* sym TColon <* sym TInteger <* sym TSemicolon);
  sym TBegin;
  many $ antiArrowLift $ do {
    x <- ident;
    sym TAssign;

    -- My library doesn't have "sepBy", but it can be easily created
    sepBy [TPlus] $ antiArrowLift $ do {
      y <- ident;
      pure $ when (y `notElem` declated) $ Left "Undeclated identifier"; -- Oops
    };
    sym TSemicolon;
    pure $ when (x `notElem` declated) $ Left "Undeclated identifier"; -- Oops
  };
  sym TEnd;
  sym TDot;
  pure ();
}
---
Looks good. But there is a huge problem here: look at lines marked as "Oops". They refer to
"declared", but they cannot refer to it, because outer "do" is ApplicativeDo. So, yes, merely
Applicative is not enough.

Does my library have disadvantages? Of course, it has!
- It is not monadic
- It cannot statically check that grammar is element of LR(1) set (as well as I understand,
happy can do this)
- My library has relatively good speed asymptotic (same as Earley), but it is still not fastest
- My library will freeze on infinitely ambiguous grammars. Attempting to check such grammar
for ambiguity using Earley's "upTo" will cause freezing, too. See also: https://github.com/ollef/Earley/issues/54
- My library is based on unbiased choice and CFG (as opposed to biased choice and PEG).
I consider this as advantage, but my library will not go if you want to parse language defined by some PEG

My library is unfinished. The following things are needed:
- We need combinator similar to Alternative's "many", but every item should have access to
already parsed part of list. Such combinator should be made to be used by banana brackets
- We need combinators similar to parsec's chainl and chainr (my library already supports left
and right recursion thanks to Earley, but still such combinators would be useful)
- Already mentioned "sepBy"
- I didn't wrap all Earley functionality, for example,  is left unwrapped

I don't have motivation for fix this things, because I decided to switch to Rust as my main language.

Final notes
- It is quite possible that I actually need attribute grammars or syntax-directed translation. I didn't explore this
- I suspect that my parser is arrow transformer (whatever this means)

Side note: I really want some pastebin for reproducible shell scripts (possibly dockerfiles), do you know such?

Answer me if you have any questions.

==
Askar Safin
http://safinaskar.com
https://sr.ht/~safinaskar
https://github.com/safinaskar

Askar Safin

Jaro Reinders

Askar Safin

Jaro Reinders

tags

participants (2)