Parsing workflow

Nils Schweinsberg

31 Oct 2010 31 Oct '10

3:11 p.m.

Hi! I'm having a really hard time to write a correct parser for a small language I've developed. I have been trying to write a parser using parsec, but always get a lot of error messages like "unexpected "\n", expected ..., new-line or..." when trying to run the parser. Then I read about the happy parser and really liked the separation of lexing the text into tokens and parsing the actual logic behind those tokens. Since I couldn't get familiar with the lexer "alex" I gave up on the alex-happy-approach again and went back to parsec. But with that lexer->parser idea on my mind, my parser currently looks a lot like a lexer. So I came up with the idea of using a combination of parsec and happy, where I generate a list of tokens for my text via parsec and analyse it with happy. My questions would be: - Is this a valid approach? - What is your workflow on parsing complex data structures? - What about performance? Since my project is going to be an interpreted language parsing performance might be interesting aswell. I've read that happy is in general faster than parsec, but what if I combine both of them as I said above? I guess that parsing a simple list of tokens without any nested parser structures would be pretty fast? - Do you have any other ideas on how to improve my parser? - What are your general thoughts on happy vs. parsec? Thanks for any replies, Nils

Show replies by date

Ozgur Akgun

31 Oct 31 Oct

3:50 p.m.

I don't know if you've already used it, but Parsec includes some kind of a lexer through the Languagehttp://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse...and Tokenhttp://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse...modules. You can start by having a look at the makeTokenParserhttp://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse... function. On 31 October 2010 15:11, Nils Schweinsberg wrote:

...

Hi!

I'm having a really hard time to write a correct parser for a small language I've developed. I have been trying to write a parser using parsec, but always get a lot of error messages like "unexpected "\n", expected ..., new-line or..." when trying to run the parser. Then I read about the happy parser and really liked the separation of lexing the text into tokens and parsing the actual logic behind those tokens. Since I couldn't get familiar with the lexer "alex" I gave up on the alex-happy-approach again and went back to parsec. But with that lexer->parser idea on my mind, my parser currently looks a lot like a lexer. So I came up with the idea of using a combination of parsec and happy, where I generate a list of tokens for my text via parsec and analyse it with happy.

My questions would be:

- Is this a valid approach?

- What is your workflow on parsing complex data structures?

- What about performance? Since my project is going to be an interpreted language parsing performance might be interesting aswell. I've read that happy is in general faster than parsec, but what if I combine both of them as I said above? I guess that parsing a simple list of tokens without any nested parser structures would be pretty fast?

- Do you have any other ideas on how to improve my parser?

- What are your general thoughts on happy vs. parsec?

Thanks for any replies,

Nils _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Ozgur Akgun

Nils Schweinsberg

4:04 p.m.

Am 31.10.2010 16:50, schrieb Ozgur Akgun:

...

I don't know if you've already used it, but Parsec includes some kind of a lexer through the Language

http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse...

...

and Token

http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse...

...

modules. You can start by having a look at the makeTokenParser

http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse... function. Yeah, I've read about the TokenParser, but since my language is not a typical programming language it's use is very limited for me. My language is basicly a combination of a scripting language and a markup language like, for example, markdown. And parsing that scripting language isn't the difficult part so far...

Vo Minh Thu

3:53 p.m.

2010/10/31 Nils Schweinsberg :

...

Hi!

I'm having a really hard time to write a correct parser for a small language I've developed. I have been trying to write a parser using parsec, but always get a lot of error messages like "unexpected "\n", expected ..., new-line or..." when trying to run the parser. [snip]

Hi, I can't really tell from your description, but maybe this is because of the way Parsec works when it deals with alternatives. When you combine several parsers with e.g. '<|>' or 'choice', an alternative that can consume some input but fails will make the whole combined parser fail too. So you have to either factorize you parsers or use the 'try'. See the documentation for 'try' at http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse... . HTH, Thu

Nils Schweinsberg

4:15 p.m.

Am 31.10.2010 16:53, schrieb Vo Minh Thu:

...

I can't really tell from your description, but maybe this is because of the way Parsec works when it deals with alternatives. When you combine several parsers with e.g. '<|>' or 'choice', an alternative that can consume some input but fails will make the whole combined parser fail too. So you have to either factorize you parsers or use the 'try'. See the documentation for 'try' at http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse... .

This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...

Ozgur Akgun

4:23 p.m.

On 31 October 2010 16:15, Nils Schweinsberg wrote:

...

This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...

I always thought this was an obvious decision: when combining 2 parsers `a` and `b` (i.e. a <|> b), if parser `a` fails without consuming any input do not wrap it in a try, otherwise do. Am I missing something? Ozgur

Stephen Tetley

4:27 p.m.

On 31 October 2010 16:23, Ozgur Akgun wrote:

...

Am I missing something?

Left factoring! :-)

Malcolm Wallace

5:47 p.m.

On 31 Oct 2010, at 16:15, Nils Schweinsberg wrote:

...

Am 31.10.2010 16:53, schrieb Vo Minh Thu:

...
So you have to either factorize you parsers or use the 'try'.

This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...

Have you considered using a different set of parser combinators, a set that is actually composable, and does not require the mysterious "try"? I would recommend polyparse (because I wrote it), but uuparse would also be a fine choice. Regards, Malcolm

Heinrich Apfelmus

1 Nov 1 Nov

9:13 a.m.

Malcolm Wallace wrote:

...

Nils Schweinsberg wrote:

...
Vo Minh Thu wrote:

...
So you have to either factorize you parsers or use the 'try'.

This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...

Have you considered using a different set of parser combinators, a set that is actually composable, and does not require the mysterious "try"? I would recommend polyparse (because I wrote it), but uuparse would also be a fine choice.

I second that, the semantics of Parsec are quite tricky. It's worth learning them properly if you insist on using Parsec. If you don't want to do that, it's easier to use a different library. The two defining properties of Parsec's alternative <|> combinator are: 1) If a parser p consumes a character, then it will take precedence over its alternatives p <|> q = p . 2) Parsers that are in sequence with an alternative don't have any influence on which alternative is chosen. (p <|> q) >>= k = p >>= k if p succeeds q >>= k if p does not succeed but p >>= k might not succeed even if p does succeed and q >>= k would succeed. I found the following rule of thumbs helpful: * Try to use alternatives p <|> q only when p and q recognize different first characters. * Don't do recursion yourself, use premade combinators like many1 etc. instead. * lookAhead is very useful for avoiding capturing input in the second argument of manyTill Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Erik de Castro Lopo

31 Oct 31 Oct

9:50 p.m.

Nils Schweinsberg wrote:

...

This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...

In my experience, try blocks should only be used at the very inner most parts of the parser. That is avoid wrapping large combinator functions in try blocks. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Andrew Coppin

1 Nov 1 Nov

10:18 p.m.

On 31/10/2010 04:15 PM, Nils Schweinsberg wrote:

...

This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...

I vaguely recall somebody mentioning a parser library on Hackage where "try" is the default behaviour and you turn it off explicitly, rather than turning it on explicitly. Apparently this is much more intuitive. But unfortunately I can't remember what the hell the library was...

Ozgur Akgun

10:46 p.m.

On 1 November 2010 22:18, Andrew Coppin wrote:

...

I vaguely recall somebody mentioning a parser library on Hackage where "try" is the default behaviour and you turn it off explicitly, rather than turning it on explicitly. Apparently this is much more intuitive. But unfortunately I can't remember what the hell the library was...

polyparse, is it? from http://code.haskell.org/~malcolm/polyparse/docs/index.html#what

...

If you have only ever used the parsec combinators before, then you might be

pleasantly surprised by polyparse: all of the same functionality is

...

available,

but it removes the confusion that all too commonly arises from a failure to use parsec's try combinator correctly. Ambiguous grammars often fail to be compositional in parsec, and it can be a black art guessing where to

...

introduce

a try to fix it. In contrast, polyparsers are by default fully

...

compositional.

It is possible to improve their efficiency (and the accuracy of error messages) by inserting commits (which are the dual of try), but it is not necessary for writing a correct parser, and furthermore, it is usually

...

obvious

where it can be beneficial to add a commit. Ozgur

S D Swierstra

20 Jan 20 Jan

3:32 p.m.

On Oct 31, 2010, at 17:15 , Nils Schweinsberg wrote:

...

Am 31.10.2010 16:53, schrieb Vo Minh Thu:

...
I can't really tell from your description, but maybe this is because of the way Parsec works when it deals with alternatives. When you combine several parsers with e.g. '<|>' or 'choice', an alternative that can consume some input but fails will make the whole combined parser fail too. So you have to either factorize you parsers or use the 'try'. See the documentation for 'try' at http://hackage.haskell.org/packages/archive/parsec/3.1.0/doc/html/Text-Parse... .

This is exactly what gives me headaches. It's hard to tell where you need try/lookAhead and where you don't need them. And I don't really feel comfortable wrapping everything into try blocks...

This is precisely why you should use a more general parser library like uu-parsing and try to avoid the more low-level techniques used in Parsec; uu-parsinglib avoids all the confusion arising from the use of try constructs. It furthermore gives you an online result, error correction and nice error messages. Its Utils module contains a lot of useful "standards" elements you might want to recognise. Thus far i have only happy users, and if you are having any problems please let me know. Doaitse PS: for all parsing libraries it holds that parsing times are negligeable when compared to the time spent on what you want to do with the parsed result.

...

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Roman Cheplyaka

21 Jan 21 Jan

8:47 a.m.

New subject: uu-parsinglib

* S D Swierstra [2012-01-20 16:32:00+0100]

...

Thus far i have only happy users, and if you are having any problems please let me know.

Hi Doaitse, One thing that I'm curious about is how you avoid combinatorial explosion when parsing in a breadth-first fashion. Do you somehow manage to merge the branches? Or just hope that in practice it won't be too bad? -- Roman I. Cheplyaka :: http://ro-che.info/

S D Swierstra

8:53 p.m.

New subject: uu-parsinglib

The there is an amb combinator which you can use to lift an ambiguous parser into one which returns a list of results. This will take care that the part after the ambiguous non-terminal is only parsed once. Like most libraries I do indeed not cache results of applying parsers at specific positions, besides the use of amb. This is more than most other libraries do. Besides this there are various greedy versions of combinators which take care of repeating construct; so there is an pChainL and a pChainL_ng etc. This has a bit the same effect as the try combinator, but in a more structured way. Of course applying some left-factoring is always a good idea; but i think this should only be done once real problems arise, which is not very often. In case these solutions are not sufficient one could of course take one of the more advanced libraries, which performa real grammar analysis, such as a left corner transform. There are two libraries supporting this: the grammar-combinator package of Dominique deVriese and the Christmas tree package from Utrecht, which is mainly aimed at solving problems with the exponential complexity in GHC's implementation of the function read for values of data types with infix constructors. The first one uses many recent GHC extensions. Both are less of a combinator parsing style, since support for freely abstracting over common grammatical patterns is somewhat more problematic. Both can use uu-parsinglib as back end, Doaitse PS: previous implementations of GHC made some assumptions about the kind of programs people would write. Unfortunately some parsers using the older uulib package were thus spending 99.5% of their time in garbage collection. This has been cured, thanks to work by Simon Marlow. Some parsers run over 50 times faster!! On Jan 21, 2012, at 9:47 , Roman Cheplyaka wrote:

...

* S D Swierstra [2012-01-20 16:32:00+0100]

...
Thus far i have only happy users, and if you are having any problems please let me know.

Hi Doaitse,

One thing that I'm curious about is how you avoid combinatorial explosion when parsing in a breadth-first fashion. Do you somehow manage to merge the branches? Or just hope that in practice it won't be too bad?

-- Roman I. Cheplyaka :: http://ro-che.info/

Stephen Tetley

31 Oct 31 Oct

3:55 p.m.

If you use the Language and Token modules, Parsec gives you something close to a lexer / parser separation _but_ you can drop down to character level parsers if you want to - this is very handy. There are some caveats though - for instance, the number parsers from the Token module follow Haskell's lexical syntax but you can override them if necessary. You can also write separate parsers this is covered in the (pdf) Parsec manual available from Daan Leijen's old home page, however I usually avoid this as it seems rather cumbersome. Happy / Alex seems generally much faster than Parsec, I would imagine the combination of Happy and Parsec to inherit the speed of Parsec. Personally I prefer Parsec, uu-parsing is also nice but it didn't have the equivalent of Parsec's Token parsers so it was always a bit less convenient. However, if I'm starting from an LR grammar I'll use Happy / Alex.

Stephen Tetley

4:03 p.m.

On 31 October 2010 15:55, Stephen Tetley wrote: ecessary.

...

You can also write separate parsers this is covered in the (pdf) Parsec manual available from Daan Leijen's old home page, however I usually avoid this as it seems rather cumbersome.

D'oh. I meant "separate _scanners_" of course, see chapter 2.11 Advanced: Seperate scanners

Malcolm Wallace

6 p.m.

...

- Is this a valid approach?

It is possible that your Parsec lexer will need to see the entire input before it delivers any tokens at all to the Happy parser. This might cause a space problem, depending on how large your inputs are likely to be.

...

- What is your workflow on parsing complex data structures?

I usually write a lexer by hand, as a little state machine delivering a lazy list of tokens, then pass the tokens as the input to a grammar built from parser combinators. For larger inputs, I use lazy parser combinators to avoid space leaks.

...

- What about performance? Since my project is going to be an interpreted language parsing performance might be interesting aswell. I've read that happy is in general faster than parsec, but what if I combine both of them as I said above? I guess that parsing a simple list of tokens without any nested parser structures would be pretty fast?

Parser combinators are rarely a performance bottleneck in my experience. However, relying on parser combinators to do lexing often slows things down too much (too much back-tracking). Regards, Malcolm

4922

Age (days ago)

5369

Last active (days ago)

List overview

Download

17 comments

10 participants

participants (10)

Andrew Coppin
Erik de Castro Lopo
Heinrich Apfelmus
Malcolm Wallace
Nils Schweinsberg
Ozgur Akgun
Roman Cheplyaka
S D Swierstra
Stephen Tetley
Vo Minh Thu