Re: [Haskell-cafe] attoparsec and backtracking

On 3/15/13 3:29 PM, Evan Laforge wrote:
However, which error msg shows up depends on the order of the (<|>) alternatives, and in general the global structure of the entire parser, because I think it just backtracks and then picks the last failing backtrack. Even after carefully rearranging all the parsers it seems impossible to get this particular error to bubble up to the top. The thing is, as soon as I see an unexpected suffix I know I can fail entirely right there, with exactly that error msg, but since there's no way to turn off backtracking I think there's no way to do that.
I had some similar issues recently. The trick is figuring out how to convince attoparsec to commit to a particular alternative. For example, consider the grammar: A (B A)* C; where if the B succeeds then we want to commit to parsing an A (and if it fails then return A's error, not C's). To simplify things, let's drop the leading A since it's not part of the problem. And let's try to parse an invalid string like "BX" (or "BABX"). The key point is that, bad = (pB *> pure (:) <*> pA <*> bad) <|> (pC *> pure []) is different than, good = do e <- eitherP pB pC -- (Left <$> pB) <|> (Right <$> pC) case e of Left _ -> (:) <$> pA <*> good Right _ -> pure [] In particular, the first one is bad (for our purposes) because due to hoisting the choice up high, after parsing the B we fail to commit, so when parsing A fails we'll backtrack over the B and try C instead. Assuming C doesn't overlap with B, we'll then report C's error. Whereas the latter is good because due to pushing the choice down, once we've parsed B (or C) we're committed to that choice; so when A fails, we'll report A's error (or backtrack to the lowest choice that dominates the call to good). Attoparsec does indeed just report the failure generated by the final parse, so you'll have to refactor things to recognize which sort of token you're looking for (e.g., p_num vs p_identifier or whatever), and then commit to that choice before actually parsing the token. It's not very modular that way, but I think that's the only option right now. It shouldn't be too hard to design a combinator for doing a hard commit (by discarding the backtrack continuation); but that has modularity issues of its own... Another option, of course, is to drop down to performing lexing on the ByteString itself (e.g., [1]) and then wrap those individual lexers to work as attoparsec Parsers. Even if using attoparsec for the heavy lifting, this is a good approach for maximizing performance of the lexing step. [1] http://hackage.haskell.org/package/bytestring-lexing -- Live well, ~wren
participants (1)
-
wren ng thornton