Re: [Haskell-cafe] Techniques for ensuring parser correctness?

26 Jul 2010

      On Mon, Jul 26, 2010 at 12:03 AM, S. Doaitse Swierstra <
doaitse@swierstra.net> wrote:
...
On 26 jul 2010, at 03:51, Jason Dagit wrote:
Hello,
I find that parser correctness is often hard to verify.  Therefore, I'm
interested in techniques that others have used successfully, especially with
Haskell.
It seems to me that you are not so much trying to verify parsers, but more
specifically Parsec parsers. Since in Parsec-based parsers you control the
backtracking explicitly such parsers can get very complicated semantics. Now
the question arises: what does it mean for a (Parsec) parser to be correct?
Do you have another description of the  language which is to be recognised,
e.g. a context-free grammar. Only then can you give meaning to the word
"correctness".
In general I think that the more your parser combinators deviate from
context-free grammars in terms of expressiveness, the more problems you will
encounter. If you make heavy use of the monadic part, you will not only have
to prove the correctness of static parsers, but even of parsers which are
generated dynamically. If you use the backtrack-controlling features, your
proofs will become even more complicated, since it is unlikely that your
more abstract formalism in which you have specified your language does not
have a similar construct: here comes in about 50 years on research on
parsing techniques and grammar analysis. If your grammar is e.g. LL(1) then
you can verify that some of the back-tracking-controlling features in your
Parser parser have been used in a sound way, i.e., that you will be able to
parse any sentence that your grammar describes.
If you have a context-free grammar, and you want to be relatively sure that
the parser is correct and you do not want to go through large verification
efforts I suggest you use the uu-parsinglib; the only restriction there is
is that your grammar should fulfill certain modest "well-formedness"
criteria, such as being non-left-recursive and non-ambiguous. Then  the
semantics of the combinators are exactly what you want, i.e. your parsers
and your grammars are isomorphic. If you have however an "incorrect formal
specification", i.e., a specification which contains ambiguous non-terminals
like p* constructs where p can reduce to an empty string   things break. The
first problem one is not recognised and will lead to a non-terminating
parser, whereas the second problem is detected by the grammars analysing
themselves while being used, and leading to a run-time error message once
you reach that part of the grammar during parsing.
Thanks for the reply.

I think the grammar is fairly simple, although I'm not confident classifying
it.  I know it can be parsed with just a simple pass over the data.  The
only uses of backtracking are just to figure out what is next, like a peek
at the next token.  Let me give you some samples of what the input looks
like.

Here are three entries from the "inventory" they correspond to PatchInfos:
[TAG 2.4
Reinier Lamers **20100226180900
 Ignore-this: 36ce0456c214345f55a7bc5fc142e985
]
hash:
0000000560-c6bb2c4334a557826cb1a662a8d57ccb9a78390833fab2f1d65e190939f283a3
[Make record store patch metadata in UTF-8
Reinier Lamers **20090917165301
 Ignore-this: 6640e121987d6a76479e46d9cc14413b
]
hash:
0000008496-b0170277eee44adc98f553bfbdadae1fb440cb3aaa4988ea19fbcad9d65e31b0
[Add test for UTF-8 metadata
Reinier Lamers **20090917165327
 Ignore-this: 3e81237e8af61a45d64ac60269e1fe90 UTF-8
]
hash:
0000004693-d258a7f56c4ed067d219b540ca6b0ce2e2d66bb5fa9e86799a17504f6ebfce38

The brackets delimit the PatchInfos.  The first line is the short
description, or name, of the patch.  The next line, up to the first *, is
the author.  The second * could also be a -, these are followed by the
date/timestamp.  All the lines between the date and closing bracket must
start with a space that gets dropped by the parser.  These lines constitute
the long patch description.  I think the lines here start with a leading
space so that brackets appearing in the description do not need to be
escaped.  Most patches have no long description so his part could be empty,
although modern darcs inserts headers here in ever patch, such as the
"Ignore-this:" field.

The line "hash: ...", gives the file name of the patch that corresponds to
the PatchInfo immediately before it.  The hash line is optional because the
initial versions of darcs did not have this feature.  The parser for
inventories reads as many PatchInfos as it finds.  I think there is always
at least one, but I'm not certain of that.  The corner cases would be newly
created repository, or immediately after tagging a repository.  Inventories
are split at tags and new repositories have no patches.

The format of patches themselves is similarly linear.  Here is an example,
taking the top several lines of the second patch listed above:
[Make record store patch metadata in UTF-8
Reinier Lamers **20090917165301
 Ignore-this: 6640e121987d6a76479e46d9cc14413b
] hunk ./src/ByteStringUtils.hs 49
-        intercalate
+        intercalate,
+        getArgsLocale,
+        decodeLocale,
+        encodeLocale,
+        encodeLatin1,
+        decodeString,
+        utf8ToLocale
hunk ./src/ByteStringUtils.hs 70
+import System.Environment       ( getArgs )
+import System.Console.Haskeline ( runInputT, defaultSettings )
+import System.Console.Haskeline.Encoding ( decode, encode )
hunk ./src/ByteStringUtils.hs 103
+import Codec.Binary.UTF8.Generic ( toString )
+
hunk ./src/ByteStringUtils.hs 500
+
+-- | A locale-aware version of getArgs

Notice that the PatchInfo is repeated, followed by a line describing which
type of patch it is.  In this case they are all "hunk" patches.  It could be
other things though, like "addfile", "adddir", "rmdir", and a few others.
 That first line varies between the patch types, but it always specifies the
patch type, usually mentions a directory or filename, and maybe some other
information.  In the case of hunk patches, it mentions which line of the
file to start modifying at.  The lines starting with - or + are lines to be
removed or added.  As you can see from the second hunk, "hunk
./src/ByteStringUtils.hs 70", that either set of lines can be omitted.  In
this case, there are no lines to be removed, so no lines starting with -.
 The lines starting with +, when present, always come after the lines
starting with -.  This input continues this way until all the patches are
enumerated.  That is, there is no special marker at the end, this parser
simple reads to the end of input returning a list of the patches read.

The parser for patches uses 'try' to figure out if the next token is
"addfile", "hunk", and so on but otherwise it just does a straight pass over
the data without any significant backtracking.

I strongly suspect that as far as grammars are concerned, this one is very
simple.  The trick is to parse it correctly, robustly, and efficiently while
accounting for backwards compatibility.  The format has evolved slightly
over the years.

Now that you've seen some examples, what would you recommend?

Jason