attoparsec and backtracking

I have a couple of problems with attoparsec which I think are related to its always backtrack nature, but maybe there's some other way to solve the same problems. The first is that it's hard to get the right error msg out. For instance, I have a parser that tries to parse a number with an optional type suffix. It's an error if the suffix is unrecognized: p_num :: A.Parser Score.TypedVal p_num = do num <- p_untyped_num suffix <- A.option "" ((:"") <$> A.letter_ascii) case Score.code_to_type suffix of Nothing -> fail $ "p_num expected suffix in [cdsr]: " ++ show suffix Just typ -> return $ Score.Typed typ num However, which error msg shows up depends on the order of the (<|>) alternatives, and in general the global structure of the entire parser, because I think it just backtracks and then picks the last failing backtrack. Even after carefully rearranging all the parsers it seems impossible to get this particular error to bubble up to the top. The thing is, as soon as I see an unexpected suffix I know I can fail entirely right there, with exactly that error msg, but since there's no way to turn off backtracking I think there's no way to do that. The second thing is that I want to lex a single token. I thought I could just parse a term and then see how far I got and then use that index to splitAt the input. But attoparsec doesn't keep track of the current index, so I wrote ((,) <$> p_term <*> A.takeByteString), and then I can use the length of the left over bytestring. But due to backtracking, that changes what p_term will parse. Since takeByteString always succeeds, it will cause p_term to backtrack until it finds some prefix that will match. The result is that instead of failing to parse, "1." will lex as ("1", "."). Since I integrate lexing and parsing (as is common with combinator parsers), and since it seems I can't keep track of byte position with attoparsec, I think I'm out of luck trying to do this the easy way. I think I have to write a separate lexer that tries to have the same behaviour as the parser, but is all separate code. I know attoparsec was designed for speed, and error reporting and source positions are secondary. Am I simply asking too much of it? I originally used parsec, but parsing speed is my main bottleneck, so I don't want to give ground there. Is there a clever way to get attoparsec to do what I want? Or a ByteString or Text parser out there which is fast, but can not backtrack, or keep track of input position? I've heard some good things about traditional alex+happy... of course it would mean a complete rewrite but might be interesting. Has anyone compared the performance of attoparsec vs. alex+happy?

Evan Laforge wrote:
The first is that it's hard to get the right error msg out. For instance, I have a parser that tries to parse a number with an optional type suffix. It's an error if the suffix is unrecognized:
p_num :: A.Parser Score.TypedVal p_num = do num <- p_untyped_num suffix <- A.option "" ((:"") <$> A.letter_ascii) case Score.code_to_type suffix of Nothing -> fail $ "p_num expected suffix in [cdsr]: " ++ show suffix Just typ -> return $ Score.Typed typ num
I think the mistake here is to parse something and then decide if its it valid. It should be the parser which decides whether its valid. So rather than: suffix <- A.option "" ((:"") <$> A.letter_ascii) try: typ <- A.choice [ {- list or valid suffix parsers -} ] return $ Score.Typed typ num
However, which error msg shows up depends on the order of the (<|>) alternatives, and in general the global structure of the entire parser, because I think it just backtracks and then picks the last failing backtrack.
I'm not sure if what I've offered will help, but its worth a try.
Even after carefully rearranging all the parsers it seems impossible to get this particular error to bubble up to the top.
Yes, I've found it impossible to force attoparsec to fail a parse. I think that is intended as a feature.
The thing is, as soon as I see an unexpected suffix I know I can fail entirely right there, with exactly that error msg, but since there's no way to turn off backtracking I think there's no way to do that.
Yes, that's my impression. <snip>
I originally used parsec, but parsing speed is my main bottleneck, so I don't want to give ground there.
We you using Parsec as a token parser or as a Char parser. Obviously the second is going to be slow in comparison to the first.
I've heard some good things about traditional alex+happy... of course it would mean a complete rewrite but might be interesting.
I've user Alex with both Parsec and Happy, where speed was strong secondary goal. Personally I much prefer Parsec; IMO its easier to debug and extend and modify that Happy based parsers. I also know some people prefer Happy.
Has anyone compared the performance of attoparsec vs. alex+happy?
I haven't, nor have I compared those two with alex+parsec. It would be an interesting experiment. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

On 16 March 2013 12:54, Erik de Castro Lopo
Evan Laforge wrote:
The first is that it's hard to get the right error msg out. For instance, I have a parser that tries to parse a number with an optional type suffix. It's an error if the suffix is unrecognized:
p_num :: A.Parser Score.TypedVal p_num = do num <- p_untyped_num suffix <- A.option "" ((:"") <$> A.letter_ascii) case Score.code_to_type suffix of Nothing -> fail $ "p_num expected suffix in [cdsr]: " ++ show suffix Just typ -> return $ Score.Typed typ num
I think the mistake here is to parse something and then decide if its it valid. It should be the parser which decides whether its valid. So rather than:
suffix <- A.option "" ((:"") <$> A.letter_ascii)
try:
typ <- A.choice [ {- list or valid suffix parsers -} ] return $ Score.Typed typ num
However, which error msg shows up depends on the order of the (<|>) alternatives, and in general the global structure of the entire parser, because I think it just backtracks and then picks the last failing backtrack.
I'm not sure if what I've offered will help, but its worth a try.
Even after carefully rearranging all the parsers it seems impossible to get this particular error to bubble up to the top.
Yes, I've found it impossible to force attoparsec to fail a parse. I think that is intended as a feature.
I don't know about a "feature", but I tried adding polyparse-style commit semantics to attoparsec and couldn't do so without making it rather noticeably slower.
The thing is, as soon as I see an unexpected suffix I know I can fail entirely right there, with exactly that error msg, but since there's no way to turn off backtracking I think there's no way to do that.
Yes, that's my impression.
<snip>
I originally used parsec, but parsing speed is my main bottleneck, so I don't want to give ground there.
We you using Parsec as a token parser or as a Char parser. Obviously the second is going to be slow in comparison to the first.
I've heard some good things about traditional alex+happy... of course it would mean a complete rewrite but might be interesting.
I've user Alex with both Parsec and Happy, where speed was strong secondary goal. Personally I much prefer Parsec; IMO its easier to debug and extend and modify that Happy based parsers. I also know some people prefer Happy.
Has anyone compared the performance of attoparsec vs. alex+happy?
I haven't, nor have I compared those two with alex+parsec. It would be an interesting experiment.
Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

Is it not possible to add an alternative (no pun intended) to <|> that supports the semantics Evan wants? I would agree that what attoparsec does for <|> of Alternative and mplus for MonadPlus is correct since e.g. the mplus laws say that a failure must be identity and therefore the following alternatives must be considered. I also find it very convenient that attoparsec works this way, and prefer it to what parsec does by default. However, I do not see why attoparsec cannot have a function <||> that on failure with consumed input does not evaluate the remaining alternatives. On 16/03/13 01:54, Erik de Castro Lopo wrote:
Evan Laforge wrote:
However, which error msg shows up depends on the order of the (<|>) alternatives, and in general the global structure of the entire parser, because I think it just backtracks and then picks the last failing backtrack.
I'm not sure if what I've offered will help, but its worth a try.

* Niklas Hambüchen
I would agree that what attoparsec does for <|> of Alternative and mplus for MonadPlus is correct since e.g. the mplus laws say that a failure must be identity and therefore the following alternatives must be considered. I also find it very convenient that attoparsec works this way, and prefer it to what parsec does by default.
empty/mzero are indeed identities in Parsec. What doesn't hold is the law v >> mzero = mzero But this one is often violated:
flip runState 0 $ runMaybeT mzero (Nothing,0)
flip runState 0 $ runMaybeT $ lift (modify (+1)) >> mzero (Nothing,1)
Roman

On Fri, Mar 15, 2013 at 8:49 PM, Niklas Hambüchen
Is it not possible to add an alternative (no pun intended) to <|> that supports the semantics Evan wants?
I assume it's the performance thing. Presumably it would need to pass an extra flag with to the failure continuation to tell it to not retry, though that doesn't sound so bad. Actually, Bryan's response here: https://github.com/bos/attoparsec/issues/42 makes it sound like he's not opposed to <||> on performance grounds, just that <|> is more intuitive. I agree in general, but not in the case of error msgs! So maybe I just need to see if I can make a patch to add <||>, sounds like Johan at least would be into that.

@Evan Thanks for that link, I posted a somewhat longer argument in there. Personally, I'd love <||>. On Sun 17 Mar 2013 02:02:44 GMT, Evan Laforge wrote:
On Fri, Mar 15, 2013 at 8:49 PM, Niklas Hambüchen
wrote: Is it not possible to add an alternative (no pun intended) to <|> that supports the semantics Evan wants?
I assume it's the performance thing. Presumably it would need to pass an extra flag with to the failure continuation to tell it to not retry, though that doesn't sound so bad. Actually, Bryan's response here:
https://github.com/bos/attoparsec/issues/42
makes it sound like he's not opposed to <||> on performance grounds, just that <|> is more intuitive. I agree in general, but not in the case of error msgs! So maybe I just need to see if I can make a patch to add <||>, sounds like Johan at least would be into that.

I think the mistake here is to parse something and then decide if its it valid. It should be the parser which decides whether its valid. So rather than:
suffix <- A.option "" ((:"") <$> A.letter_ascii)
try:
typ <- A.choice [ {- list or valid suffix parsers -} ] return $ Score.Typed typ num
I actually had that originally, but but switched to fail-after for the better error msg. It worked with parsec, but then I lost it again when I switched to attoparsec. I think Wren is right, I really would need to refactor the parser to put the decisions in the right spot.
We you using Parsec as a token parser or as a Char parser. Obviously the second is going to be slow in comparison to the first.
It was actually the Text version of parsec, back when that was new. I should go do a profile again someday, but since attoparsec and parsec APIs are almost but not quite the same, it's kind of a pain. I actually tried the Text version of attoparsec, back when that was not yet integrated into attoparsec itself, and bytestring was still significantly faster. So I don't know how much was Text vs. ByteString and how much was parsec vs. attoparsec.
participants (5)
-
Erik de Castro Lopo
-
Evan Laforge
-
Ivan Lazar Miljenovic
-
Niklas Hambüchen
-
Roman Cheplyaka