Techniques for ensuring parser correctness?

Hello, I find that parser correctness is often hard to verify. Therefore, I'm interested in techniques that others have used successfully, especially with Haskell. Techniques I'm aware of: * Round trip checks: Generate a datastructure, render as a string, parse back, and compare. Quickcheck can be used to automate this. * Fuzz testing: What tools exist to help me? * Formal verification: Has anyone been using this with Haskell parsers? Other than general theorem provers, say Isabelle, what tools exist? My specific need: The immediate challenge I have is that I'm modifying the parser that Darcs uses and we would like to improve the parser's test suite as well. The parser abstraction used in this case follows parsec's API. Left to my own devices I would use round trip checks, written with quickcheck, for this exercise. Because we're using a parsec style parser, I don't have a nice neat grammar handy. Thanks in advance for any advice you have! Thanks, Jason

On 26 jul 2010, at 03:51, Jason Dagit wrote:
Hello,
I find that parser correctness is often hard to verify. Therefore, I'm interested in techniques that others have used successfully, especially with Haskell.
It seems to me that you are not so much trying to verify parsers, but more specifically Parsec parsers. Since in Parsec-based parsers you control the backtracking explicitly such parsers can get very complicated semantics. Now the question arises: what does it mean for a (Parsec) parser to be correct? Do you have another description of the language which is to be recognised, e.g. a context-free grammar. Only then can you give meaning to the word "correctness". In general I think that the more your parser combinators deviate from context-free grammars in terms of expressiveness, the more problems you will encounter. If you make heavy use of the monadic part, you will not only have to prove the correctness of static parsers, but even of parsers which are generated dynamically. If you use the backtrack-controlling features, your proofs will become even more complicated, since it is unlikely that your more abstract formalism in which you have specified your language does not have a similar construct: here comes in about 50 years on research on parsing techniques and grammar analysis. If your grammar is e.g. LL(1) then you can verify that some of the back-tracking-controlling features in your Parser parser have been used in a sound way, i.e., that you will be able to parse any sentence that your grammar describes. If you have a context-free grammar, and you want to be relatively sure that the parser is correct and you do not want to go through large verification efforts I suggest you use the uu-parsinglib; the only restriction there is is that your grammar should fulfill certain modest "well-formedness" criteria, such as being non-left-recursive and non-ambiguous. Then the semantics of the combinators are exactly what you want, i.e. your parsers and your grammars are isomorphic. If you have however an "incorrect formal specification", i.e., a specification which contains ambiguous non-terminals like p* constructs where p can reduce to an empty string things break. The first problem one is not recognised and will lead to a non-terminating parser, whereas the second problem is detected by the grammars analysing themselves while being used, and leading to a run-time error message once you reach that part of the grammar during parsing. If you insist on using left-recursive parsers you may use the left-corner transform from the http://hackage.haskell.org/packages/archive/ChristmasTree/0.2/doc/html/Text-... package, or use a parser generator like happy; parser generators usually do some form of analysis (i.e. proving properties), which captures many mistakes in the design of a language. Furthermore you may take a look at: @inproceedings{DBLP:conf/mpc/BrinkHL10, author = {Kasper Brink and Stefan Holdermans and Andres L{\"o}h}, title = {Dependently Typed Grammars}, booktitle = {MPC}, year = {2010}, pages = {58-79}, ee = {http://dx.doi.org/10.1007/978-3-642-13321-3_6}, crossref = {DBLP:conf/mpc/2010}, bibsource = {DBLP, http://dblp.uni-trier.de} Doaitse Swierstra
Techniques I'm aware of: * Round trip checks: Generate a datastructure, render as a string, parse back, and compare. Quickcheck can be used to automate this. * Fuzz testing: What tools exist to help me? * Formal verification: Has anyone been using this with Haskell parsers? Other than general theorem provers, say Isabelle, what tools exist?
My specific need: The immediate challenge I have is that I'm modifying the parser that Darcs uses and we would like to improve the parser's test suite as well. The parser abstraction used in this case follows parsec's API. Left to my own devices I would use round trip checks, written with quickcheck, for this exercise. Because we're using a parsec style parser, I don't have a nice neat grammar handy.
Thanks in advance for any advice you have!
Thanks, Jason _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, Jul 26, 2010 at 12:03 AM, S. Doaitse Swierstra < doaitse@swierstra.net> wrote:
On 26 jul 2010, at 03:51, Jason Dagit wrote:
Hello,
I find that parser correctness is often hard to verify. Therefore, I'm interested in techniques that others have used successfully, especially with Haskell.
It seems to me that you are not so much trying to verify parsers, but more specifically Parsec parsers. Since in Parsec-based parsers you control the backtracking explicitly such parsers can get very complicated semantics. Now the question arises: what does it mean for a (Parsec) parser to be correct? Do you have another description of the language which is to be recognised, e.g. a context-free grammar. Only then can you give meaning to the word "correctness".
In general I think that the more your parser combinators deviate from context-free grammars in terms of expressiveness, the more problems you will encounter. If you make heavy use of the monadic part, you will not only have to prove the correctness of static parsers, but even of parsers which are generated dynamically. If you use the backtrack-controlling features, your proofs will become even more complicated, since it is unlikely that your more abstract formalism in which you have specified your language does not have a similar construct: here comes in about 50 years on research on parsing techniques and grammar analysis. If your grammar is e.g. LL(1) then you can verify that some of the back-tracking-controlling features in your Parser parser have been used in a sound way, i.e., that you will be able to parse any sentence that your grammar describes.
If you have a context-free grammar, and you want to be relatively sure that the parser is correct and you do not want to go through large verification efforts I suggest you use the uu-parsinglib; the only restriction there is is that your grammar should fulfill certain modest "well-formedness" criteria, such as being non-left-recursive and non-ambiguous. Then the semantics of the combinators are exactly what you want, i.e. your parsers and your grammars are isomorphic. If you have however an "incorrect formal specification", i.e., a specification which contains ambiguous non-terminals like p* constructs where p can reduce to an empty string things break. The first problem one is not recognised and will lead to a non-terminating parser, whereas the second problem is detected by the grammars analysing themselves while being used, and leading to a run-time error message once you reach that part of the grammar during parsing.
Thanks for the reply.
I think the grammar is fairly simple, although I'm not confident classifying
it. I know it can be parsed with just a simple pass over the data. The
only uses of backtracking are just to figure out what is next, like a peek
at the next token. Let me give you some samples of what the input looks
like.
Here are three entries from the "inventory" they correspond to PatchInfos:
[TAG 2.4
Reinier Lamers

Hi Jason Which particular file in the Darcs tree defines the parser? Small adhoc formats don't necessarily have a simple underlying grammar, even though a parser for them might not have many productions. A hand-crafted parser for such a format might often be context-sensitive, or do "clever things" particularly at the token level or with white-space. As Doaitse Swierstra noted, such clever things can quickly lead to problems of correctness.

On Mon, Jul 26, 2010 at 03:01:54 +0000, Jason Dagit wrote:
I think the grammar is fairly simple, although I'm not confident classifying it. I know it can be parsed with just a simple pass over the data. The only uses of backtracking are just to figure out what is next, like a peek at the next token. Let me give you some samples of what the input looks like.
For the interested, I think you can view http://darcs.net/src/Darcs/Patch/Read.hs or better yet darcs get --lazy http://darcs.net
Here are three entries from the "inventory" they correspond to PatchInfos: [TAG 2.4 Reinier Lamers
**20100226180900 Ignore-this: 36ce0456c214345f55a7bc5fc142e985 ]
If it turns out to be a sufficiently low-powered grammar, we should probably write it up formally and stick it in the source code for documentation. Eric PS. We've been making little bits of progress trying to document Darcs on a technical high level, eg. - http://wiki.darcs.net/DarcsInternals/Record - http://wiki.darcs.net/DarcsInternals/CacheSystem Such a grammar would be a nice addition to the good-enough-that-you-could-rewrite-Darcs-in-Fortran aspiration. -- Eric Kow http://www.nltg.brighton.ac.uk/home/Eric.Kow For a faster response, please try +44 (0)1273 64 2905.

I took a quick look at this file. To me it seems a mixture of a lexer and a parser built on top of a home brewn parser library. I see function like maybeWork which (if I interpret correctly) test whether specific conditions hold for the input, etc. Indeed it would be nice to have a grammatical description of the input format. An important question is whether you can be assured that all input is indeed correct, or whether any checking has to be done. Doaitse On 26 jul 2010, at 12:38, Eric Kow wrote:
On Mon, Jul 26, 2010 at 03:01:54 +0000, Jason Dagit wrote:
I think the grammar is fairly simple, although I'm not confident classifying it. I know it can be parsed with just a simple pass over the data. The only uses of backtracking are just to figure out what is next, like a peek at the next token. Let me give you some samples of what the input looks like.
For the interested, I think you can view
http://darcs.net/src/Darcs/Patch/Read.hs
or better yet darcs get --lazy http://darcs.net
Here are three entries from the "inventory" they correspond to PatchInfos: [TAG 2.4 Reinier Lamers
**20100226180900 Ignore-this: 36ce0456c214345f55a7bc5fc142e985 ] If it turns out to be a sufficiently low-powered grammar, we should probably write it up formally and stick it in the source code for documentation.
Eric
PS. We've been making little bits of progress trying to document Darcs on a technical high level, eg.
- http://wiki.darcs.net/DarcsInternals/Record - http://wiki.darcs.net/DarcsInternals/CacheSystem
Such a grammar would be a nice addition to the good-enough-that-you-could-rewrite-Darcs-in-Fortran aspiration.
-- Eric Kow http://www.nltg.brighton.ac.uk/home/Eric.Kow For a faster response, please try +44 (0)1273 64 2905. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, Jul 26, 2010 at 4:14 AM, S. Doaitse Swierstra wrote: I took a quick look at this file. To me it seems a mixture of a lexer and a
parser built on top of a home brewn parser library. I see function like
maybeWork which
(if I interpret correctly) test whether specific conditions hold for the
input, etc. The one Eric linked to is the current parser, but it's not the one I had in
mind when I mentioned the "parsec-like" API.
I have modified the API exported by the home brewed parser to be more
parsec-like, but my changes are still in review. You can see them here:
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=28240#a28240
I have also included the PatchInfo parser on that page as it is defined in a
different module than the one Eric linked. Indeed it would be nice to have a grammatical description of the input
format. An important question is whether you can be assured that all input
is indeed correct, or whether any checking has to be done. Most of the time darcs just consumes its own output. Sometimes humans edit
their patches, for example to rewrite history, but this is discouraged.
Also, darcs stores hashes of files and checks them so editing patches will
fail unless those hashes are updated too. When patches are mailed the
mailer might munge them so that's another time when it's good to do some
input validation.
Thanks,
Jason
participants (4)
-
Eric Kow
-
Jason Dagit
-
S. Doaitse Swierstra
-
Stephen Tetley