Token parsers in parsec consume trailing whitespace

I recently ran into a nasty little bug where token parsers in parsec consume trailing whitespace, which were consuming newlines and thus bamboozling a higher-level "sepBy" combinator. I replaced my instances of 'natural' with 'read <$> many1 digit', but this gave rise to the following questions: 1. Is there a more elegant way of doing number parsing? In particular, are there token parsers that don't consume trailing whitespace, or is there a better way to do this with the primitives. 2. It seems that the "token" approach of parsing lends itself to a different style of parsing than the one I'm doing, namely, instead of assuming all of your parsers consume exactly what they need, and no more, you assume that they consume what they need and spaces. Thus, code that looks like: do foo <- fooParser spaces bar <- barParser spaces baz <- bazParser return $ FooBarBaz foo bar baz becomes: FooBarBaz <$> fooParser <*> barParser <*> bazParser And instead of using sepBy you just use many. One of the problems I see with this approach is if I was using sepBy newline, the new token oriented parser has no way of distinguishing "foo bar baz\nfoo bar baz" from "foo bar baz foo bar baz", which is something I might want to care about. Which method do you prefer? 3. Not so much a question as a comment: when parsing entire files, be sure to add the eof combinator at the end! Cheers, Edward

Hi Edward,
1. Is there a more elegant way of doing number parsing? In particular, are there token parsers that don't consume trailing whitespace, or is there a better way to do this with the primitives.
Parsec defines a combinator it calls 'lexeme' which the tokenizer wraps each of its functions in. The purpose of the tokenizer is to create a set of parsing combinators that ignore whitespace, comments, and some other handy stuff like checking for collisions with reserved keywords. To consume the trailing whitespace is not a bug, it's an abstraction layer, and Parsec is consistent about only using this abstraction in the Token module. It's too bad that the 'nat' function in Token is not defined in Parsec's Char module, and because of that, you need to copy-paste that code or roll your own.
It seems that the "token" approach of parsing lends itself to a different style of parsing than the one I'm doing
That's correct. Sounds to me like you shouldn't bother creating a tokenizer. You might even be able to get away with using the regex library instead of Parsec. -Greg

Excerpts from Greg Fitzgerald's message of Mon Dec 14 18:44:37 -0500 2009:
It's too bad that the 'nat' function in Token is not defined in Parsec's Char module, and because of that, you need to copy-paste that code or roll your own.
"Maybe I should write a patch for that."
That's correct. Sounds to me like you shouldn't bother creating a tokenizer. You might even be able to get away with using the regex library instead of Parsec.
I think, even in a situation when I could use strictly regexes, I would still opt to use Parsec. Composability and maintainability man! Maybe I should add semicolons to the syntax to demarcate records, and then convert everything to token-style parsing. Edward
participants (2)
-
Edward Z. Yang
-
Greg Fitzgerald