
Hi! I'm having a really hard time to write a correct parser for a small language I've developed. I have been trying to write a parser using parsec, but always get a lot of error messages like "unexpected "\n", expected ..., new-line or..." when trying to run the parser. Then I read about the happy parser and really liked the separation of lexing the text into tokens and parsing the actual logic behind those tokens. Since I couldn't get familiar with the lexer "alex" I gave up on the alex-happy-approach again and went back to parsec. But with that lexer->parser idea on my mind, my parser currently looks a lot like a lexer. So I came up with the idea of using a combination of parsec and happy, where I generate a list of tokens for my text via parsec and analyse it with happy. My questions would be: - Is this a valid approach? - What is your workflow on parsing complex data structures? - What about performance? Since my project is going to be an interpreted language parsing performance might be interesting aswell. I've read that happy is in general faster than parsec, but what if I combine both of them as I said above? I guess that parsing a simple list of tokens without any nested parser structures would be pretty fast? - Do you have any other ideas on how to improve my parser? - What are your general thoughts on happy vs. parsec? Thanks for any replies, Nils

I don't know if you've already used it, but Parsec includes some kind of a
lexer through the
Languagehttp://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse...and
Tokenhttp://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse...modules.
You can start by having a look at the
makeTokenParserhttp://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse...
function.
On 31 October 2010 15:11, Nils Schweinsberg
Hi!
I'm having a really hard time to write a correct parser for a small language I've developed. I have been trying to write a parser using parsec, but always get a lot of error messages like "unexpected "\n", expected ..., new-line or..." when trying to run the parser. Then I read about the happy parser and really liked the separation of lexing the text into tokens and parsing the actual logic behind those tokens. Since I couldn't get familiar with the lexer "alex" I gave up on the alex-happy-approach again and went back to parsec. But with that lexer->parser idea on my mind, my parser currently looks a lot like a lexer. So I came up with the idea of using a combination of parsec and happy, where I generate a list of tokens for my text via parsec and analyse it with happy.
My questions would be:
- Is this a valid approach?
- What is your workflow on parsing complex data structures?
- What about performance? Since my project is going to be an interpreted language parsing performance might be interesting aswell. I've read that happy is in general faster than parsec, but what if I combine both of them as I said above? I guess that parsing a simple list of tokens without any nested parser structures would be pretty fast?
- Do you have any other ideas on how to improve my parser?
- What are your general thoughts on happy vs. parsec?
Thanks for any replies,
Nils _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Ozgur Akgun

Am 31.10.2010 16:50, schrieb Ozgur Akgun:
I don't know if you've already used it, but Parsec includes some kind of a lexer through the Language
http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse...
and Token
http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse...
modules. You can start by having a look at the makeTokenParser
http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse... function. Yeah, I've read about the TokenParser, but since my language is not a typical programming language it's use is very limited for me. My language is basicly a combination of a scripting language and a markup language like, for example, markdown. And parsing that scripting language isn't the difficult part so far...

2010/10/31 Nils Schweinsberg
Hi!
I'm having a really hard time to write a correct parser for a small language I've developed. I have been trying to write a parser using parsec, but always get a lot of error messages like "unexpected "\n", expected ..., new-line or..." when trying to run the parser. [snip]
Hi, I can't really tell from your description, but maybe this is because of the way Parsec works when it deals with alternatives. When you combine several parsers with e.g. '<|>' or 'choice', an alternative that can consume some input but fails will make the whole combined parser fail too. So you have to either factorize you parsers or use the 'try'. See the documentation for 'try' at http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse... . HTH, Thu

Am 31.10.2010 16:53, schrieb Vo Minh Thu:
I can't really tell from your description, but maybe this is because of the way Parsec works when it deals with alternatives. When you combine several parsers with e.g. '<|>' or 'choice', an alternative that can consume some input but fails will make the whole combined parser fail too. So you have to either factorize you parsers or use the 'try'. See the documentation for 'try' at http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse... .
This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...

On 31 October 2010 16:15, Nils Schweinsberg
This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...
I always thought this was an obvious decision: when combining 2 parsers `a` and `b` (i.e. a <|> b), if parser `a` fails without consuming any input do not wrap it in a try, otherwise do. Am I missing something? Ozgur

On 31 Oct 2010, at 16:15, Nils Schweinsberg wrote:
Am 31.10.2010 16:53, schrieb Vo Minh Thu:
So you have to either factorize you parsers or use the 'try'.
This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...
Have you considered using a different set of parser combinators, a set that is actually composable, and does not require the mysterious "try"? I would recommend polyparse (because I wrote it), but uuparse would also be a fine choice. Regards, Malcolm

Malcolm Wallace wrote:
Nils Schweinsberg wrote:
Vo Minh Thu wrote:
So you have to either factorize you parsers or use the 'try'.
This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...
Have you considered using a different set of parser combinators, a set that is actually composable, and does not require the mysterious "try"? I would recommend polyparse (because I wrote it), but uuparse would also be a fine choice.
I second that, the semantics of Parsec are quite tricky. It's worth learning them properly if you insist on using Parsec. If you don't want to do that, it's easier to use a different library. The two defining properties of Parsec's alternative <|> combinator are: 1) If a parser p consumes a character, then it will take precedence over its alternatives p <|> q = p . 2) Parsers that are in sequence with an alternative don't have any influence on which alternative is chosen. (p <|> q) >>= k = p >>= k if p succeeds q >>= k if p does not succeed but p >>= k might not succeed even if p does succeed and q >>= k would succeed. I found the following rule of thumbs helpful: * Try to use alternatives p <|> q only when p and q recognize different first characters. * Don't do recursion yourself, use premade combinators like many1 etc. instead. * lookAhead is very useful for avoiding capturing input in the second argument of manyTill Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Nils Schweinsberg wrote:
This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...
In my experience, try blocks should only be used at the very inner most parts of the parser. That is avoid wrapping large combinator functions in try blocks. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

On 31/10/2010 04:15 PM, Nils Schweinsberg wrote:
This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...
I vaguely recall somebody mentioning a parser library on Hackage where "try" is the default behaviour and you turn it off explicitly, rather than turning it on explicitly. Apparently this is much more intuitive. But unfortunately I can't remember what the hell the library was...

On 1 November 2010 22:18, Andrew Coppin
I vaguely recall somebody mentioning a parser library on Hackage where "try" is the default behaviour and you turn it off explicitly, rather than turning it on explicitly. Apparently this is much more intuitive. But unfortunately I can't remember what the hell the library was...
polyparse, is it? from http://code.haskell.org/~malcolm/polyparse/docs/index.html#what
If you have only ever used the parsec combinators before, then you might be
pleasantly surprised by polyparse: all of the same functionality is
available,
but it removes the confusion that all too commonly arises from a failure to use parsec's try combinator correctly. Ambiguous grammars often fail to be compositional in parsec, and it can be a black art guessing where to
introduce
a try to fix it. In contrast, polyparsers are by default fully
compositional.
It is possible to improve their efficiency (and the accuracy of error messages) by inserting commits (which are the dual of try), but it is not necessary for writing a correct parser, and furthermore, it is usually
obvious
where it can be beneficial to add a commit. Ozgur

On Oct 31, 2010, at 17:15 , Nils Schweinsberg wrote:
Am 31.10.2010 16:53, schrieb Vo Minh Thu:
I can't really tell from your description, but maybe this is because of the way Parsec works when it deals with alternatives. When you combine several parsers with e.g. '<|>' or 'choice', an alternative that can consume some input but fails will make the whole combined parser fail too. So you have to either factorize you parsers or use the 'try'. See the documentation for 'try' at http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse... .
This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...
This is precisely why you should use a more general parser library like uu-parsing and try to avoid the more low-level techniques used in Parsec; uu-parsinglib avoids all the confusion arising from the use of try constructs. It furthermore gives you an online result, error correction and nice error messages. Its Utils module contains a lot of useful "standards" elements you might want to recognise. Thus far i have only happy users, and if you are having any problems please let me know. Doaitse PS: for all parsing libraries it holds that parsing times are negligeable when compared to the time spent on what you want to do with the parsed result.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

* S D Swierstra
Thus far i have only happy users, and if you are having any problems please let me know.
Hi Doaitse, One thing that I'm curious about is how you avoid combinatorial explosion when parsing in a breadth-first fashion. Do you somehow manage to merge the branches? Or just hope that in practice it won't be too bad? -- Roman I. Cheplyaka :: http://ro-che.info/

The there is an amb combinator which you can use to lift an ambiguous parser into one which returns a list of results. This will take care that the part after the ambiguous non-terminal is only parsed once. Like most libraries I do indeed not cache results of applying parsers at specific positions, besides the use of amb. This is more than most other libraries do. Besides this there are various greedy versions of combinators which take care of repeating construct; so there is an pChainL and a pChainL_ng etc. This has a bit the same effect as the try combinator, but in a more structured way. Of course applying some left-factoring is always a good idea; but i think this should only be done once real problems arise, which is not very often. In case these solutions are not sufficient one could of course take one of the more advanced libraries, which performa real grammar analysis, such as a left corner transform. There are two libraries supporting this: the grammar-combinator package of Dominique deVriese and the Christmas tree package from Utrecht, which is mainly aimed at solving problems with the exponential complexity in GHC's implementation of the function read for values of data types with infix constructors. The first one uses many recent GHC extensions. Both are less of a combinator parsing style, since support for freely abstracting over common grammatical patterns is somewhat more problematic. Both can use uu-parsinglib as back end, Doaitse PS: previous implementations of GHC made some assumptions about the kind of programs people would write. Unfortunately some parsers using the older uulib package were thus spending 99.5% of their time in garbage collection. This has been cured, thanks to work by Simon Marlow. Some parsers run over 50 times faster!! On Jan 21, 2012, at 9:47 , Roman Cheplyaka wrote:
* S D Swierstra
[2012-01-20 16:32:00+0100] Thus far i have only happy users, and if you are having any problems please let me know.
Hi Doaitse,
One thing that I'm curious about is how you avoid combinatorial explosion when parsing in a breadth-first fashion. Do you somehow manage to merge the branches? Or just hope that in practice it won't be too bad?
-- Roman I. Cheplyaka :: http://ro-che.info/

If you use the Language and Token modules, Parsec gives you something close to a lexer / parser separation _but_ you can drop down to character level parsers if you want to - this is very handy. There are some caveats though - for instance, the number parsers from the Token module follow Haskell's lexical syntax but you can override them if necessary. You can also write separate parsers this is covered in the (pdf) Parsec manual available from Daan Leijen's old home page, however I usually avoid this as it seems rather cumbersome. Happy / Alex seems generally much faster than Parsec, I would imagine the combination of Happy and Parsec to inherit the speed of Parsec. Personally I prefer Parsec, uu-parsing is also nice but it didn't have the equivalent of Parsec's Token parsers so it was always a bit less convenient. However, if I'm starting from an LR grammar I'll use Happy / Alex.

On 31 October 2010 15:55, Stephen Tetley
You can also write separate parsers this is covered in the (pdf) Parsec manual available from Daan Leijen's old home page, however I usually avoid this as it seems rather cumbersome.
D'oh. I meant "separate _scanners_" of course, see chapter 2.11 Advanced: Seperate scanners

- Is this a valid approach?
It is possible that your Parsec lexer will need to see the entire input before it delivers any tokens at all to the Happy parser. This might cause a space problem, depending on how large your inputs are likely to be.
- What is your workflow on parsing complex data structures?
I usually write a lexer by hand, as a little state machine delivering a lazy list of tokens, then pass the tokens as the input to a grammar built from parser combinators. For larger inputs, I use lazy parser combinators to avoid space leaks.
- What about performance? Since my project is going to be an interpreted language parsing performance might be interesting aswell. I've read that happy is in general faster than parsec, but what if I combine both of them as I said above? I guess that parsing a simple list of tokens without any nested parser structures would be pretty fast?
Parser combinators are rarely a performance bottleneck in my experience. However, relying on parser combinators to do lexing often slows things down too much (too much back-tracking). Regards, Malcolm
participants (10)
-
Andrew Coppin
-
Erik de Castro Lopo
-
Heinrich Apfelmus
-
Malcolm Wallace
-
Nils Schweinsberg
-
Ozgur Akgun
-
Roman Cheplyaka
-
S D Swierstra
-
Stephen Tetley
-
Vo Minh Thu