An irritating Parsec problem

I like Parsec. I use it for everything. But it does have one irritating problem. Consider the following parser: expressions = many1 expression Suppose this is the top-level parser for my language. Now suppose the user supplies an expression with a syntax error half way through it. What I *want* to happen is for an error to be raised. What *actually* happens is that Parsec just ignores all input after that point. So if "+" is not a valid token, but the user writes x = 1; y = 2; z = 3 + z; w = 4; then what my program receives back is "x = 1; y = 2; z = 3", as if everything parsed successfully. But actually it has ignored half the input! o_O Does anybody know how to fix this irratiting quirk? I can see why it happens, but not how to fix it.

On Wed, 2008-10-15 at 20:22 +0100, Andrew Coppin wrote:
I like Parsec. I use it for everything. But it does have one irritating problem.
Consider the following parser:
expressions = many1 expression
Suppose this is the top-level parser for my language.
I always wrap my top-level parsers in return const `ap` parser `ap` eof to express that they have to match the entire input. (This is a bit easier if you supply the missing Applicative instance: const <$> parser <*> eof ). I think Parsec should either do this itself or tell you what the un-consumed input tokens were, but it doesn't. jcc

On Wed, 15 Oct 2008, Andrew Coppin wrote:
Suppose this is the top-level parser for my language.
<snip>
Does anybody know how to fix this irratiting quirk? I can see why it happens, but not how to fix it.
One of: expressions = many1 (try expression <|> myFail) where myFail = {- eat your way to the next expression -} or do a prepass splitting your input up into expressions and feed the individual expressions into Parsec. Parsec's not designed to do error recovery as such, so it's something you need to work out how to handle if you need it. -- flippa@flippac.org 'In Ankh-Morpork even the shit have a street to itself... Truly this is a land of opportunity.' - Detritus, Men at Arms

On Wed, 15 Oct 2008, Andrew Coppin wrote:
Suppose this is the top-level parser for my language. Now suppose the user supplies an expression with a syntax error half way through it. What I *want* to happen is for an error to be raised. What *actually* happens is that Parsec just ignores all input after that point. So if "+" is not a valid token, but the user writes
x = 1; y = 2; z = 3 + z; w = 4;
then what my program receives back is "x = 1; y = 2; z = 3"
That'll teach me not to scan-read when I'm tired! expressions = do es <- many1 expression eof return es -- flippa@flippac.org Society does not owe people jobs. Society owes it to itself to find people jobs.

On Wed, 15 Oct 2008, Andrew Coppin wrote:
Philippa Cowderoy wrote:
expressions = do es <- many1 expression eof return es
Ah - so "eof" fails if it isn't the end of the input?
eof = notFollowedBy anyChar (assuming I've got the identifiers right, that's the actual definition too) -- flippa@flippac.org Society does not owe people jobs. Society owes it to itself to find people jobs.

Philippa Cowderoy wrote:
On Wed, 15 Oct 2008, Andrew Coppin wrote:
Philippa Cowderoy wrote:
expressions = do es <- many1 expression eof return es
Ah - so "eof" fails if it isn't the end of the input?
eof = notFollowedBy anyChar
(assuming I've got the identifiers right, that's the actual definition too)
OK, well that'll make it fail alright. Now I just gotta figure out how to get a sane error message out of it! ;-) (The example I showed is very simple; real parsers generally aren't.)

Here's what I have in one file: -- | Parse the text of an event with the given parser @p@. parse :: (Monad m) => P.CharParser () a -> String -> Derive.DeriveT m a parse p text = do (val, rest) <- case P.parse (p_rest p) "" text of Left err -> Derive.throw $ "parse error on char " ++ show (P.sourceColumn (P.errorPos err)) ++ " of " ++ show text ++ ": " ++ Seq.replace "\n" "; " (show_error_msgs (Parsec.Error.errorMessages err)) Right val -> return val unless (null rest) $ Derive.warn $ "trailing junk: " ++ show rest return val -- Contrary to its documentation, showErrorMessages takes a set of strings -- for translation, which makes it hard to use. show_error_msgs = Parsec.Error.showErrorMessages "or" "unknown parse error" "expecting" "unexpected" "end of input" p_rest :: P.GenParser tok st t -> P.GenParser tok st (t, [tok]) p_rest p = do val <- p rest <- P.getInput return (val, rest) And this reminds me of something I was going to ask about: it would be nice to fix either the documentation for showErrorMessages or the implementation. Preferably the implementation, because I can't see the current implementation actually being useful for translation...

Andrew Coppin wrote:
Philippa Cowderoy wrote:
On Wed, 15 Oct 2008, Andrew Coppin wrote:
Philippa Cowderoy wrote:
expressions = do es <- many1 expression eof return es
Ah - so "eof" fails if it isn't the end of the input?
eof = notFollowedBy anyChar
(assuming I've got the identifiers right, that's the actual definition too)
OK, well that'll make it fail alright. Now I just gotta figure out how to get a sane error message out of it! ;-)
(The example I showed is very simple; real parsers generally aren't.)
Actually, I added this to my real parser, and it actually seems to do exactly what I want. Give it an invalid expression and it immediately pinpoints exactly where the problem is, why it's a problem, and what you should be doing instead. Neat!

On Thu, 16 Oct 2008, Andrew Coppin wrote:
Actually, I added this to my real parser, and it actually seems to do exactly what I want. Give it an invalid expression and it immediately pinpoints exactly where the problem is, why it's a problem, and what you should be doing instead. Neat!
Yep. There're some wrinkles (normally involving negation in the grammar), but by and large it gets it right - and doubly so if you sprinkle >s around. -- flippa@flippac.org "I think you mean Philippa. I believe Phillipa is the one from an alternate universe, who has a beard and programs in BASIC, using only gotos for control flow." -- Anton van Straaten on Lambda the Ultimate

On Wed, Oct 15, 2008 at 2:22 PM, Andrew Coppin
So if "+" is not a valid token, but the user writes x = 1; y = 2; z = 3 + z; w = 4; then what my program receives back is "x = 1; y = 2; z = 3"
You said you expect one or more 'expression'. It looks as if your expression can optionally be terminated by semicolon? Can you demand semicolons at the ends of your expressions? Then, "z = 3" would not constitute a complete expression and an error would be raised.
participants (6)
-
Andrew Coppin
-
brian
-
Bryan O'Sullivan
-
Evan Laforge
-
Jonathan Cast
-
Philippa Cowderoy