Parsec - separating Parsing from Lexing

Hello. I'm currently implementing a MicroJava compiler for a college assignment (the implemented language was defined, the implementation language was of free choice). I've sucessfully implemented the lexer using Parsec. It has the type String -> Parser [MJVal], where MJVal are all the possible tokens. However, I don't know how to implement the parser, or at least how to do it keeping it distinguished from the lexer. For example, the MicroJava grammar specifies: Program = "program" ident {ConstDecl | VarDecl | ClassDecl} "{" {MethodDecl} "}". The natural solution (for me) would be: program = do string "program" programName <- identifier ... However, I can't do this because the file is already tokenized, what I have is something like: [Program_, identifier_ "testProgram", lBrace_, ...] for the example program: program testProgram { ... How should I implement the parser separated from the lexer? That is, how should I parse Tokens instead of Strings in the "Haskell way"? Fernando Henrique Sanches

I've sucessfully implemented the lexer using Parsec. It has the type String -> Parser [MJVal], where MJVal are all the possible tokens.
Great! You're partway there.
How should I implement the parser separated from the lexer? That is, how should I parse Tokens instead of Strings in the "Haskell way"?
AFAIK, the difference is between having an input stream of type [Char] vs. having a stream of tokens, e.g. [MJVal]. I haven't used Parsec myself, but perhaps you want something of type 'GenParser MJVal s r' for some state s and return type r? There's an excellent set of lecture notes from a class we have at Utrecht, formerly called Grammars and Parsing and now Languages and Compilers, in which we use parser combinators in Haskell . http://people.cs.uu.nl/johanj/publications/MAIN.pdf It's not Parsec, but there's plenty of useful general information in there, cf. 4.5.1, 4.5.2. Regards, Sean

You have to bootstrap your parser with something that takes an `MJVal` and updates the parser state. Here is a simple example: http://github.com/jsnx/system-uuid/blob/master/Options.hs This is a parser for command line options. It parses a list of `String`s, not `Char`s (because `argv` is `char**` and not `char*`, right?) so we introduce `stringPrim` and then build up the primitives from that. -- Jason Dusek

Hello again.
First, I'd like to thank you both for your help. You gave me nice tips and
nice material.
However, I think there is something I'm missing here. I've coded most of the
parser, and I have a feeling I'm doing some great mistake. Not only my code
won't compile, I don't know how to deal with multiple "possible return
values" and worse yet - I don't know how I'll proceed from this to Semantic
Analysis and, later, code generation. Honestly, I'm thinking I'll have to
scrap out all of this code and use the (ugh) Java template code the teacher
gave us.
My code is divided in three files:
Parser.hs (the real parsing, problematic):
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=13399#a13399
ParserTokens.hs (where I believe the problem arouse):
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=13400#a13400
Lexer.hs (the lexer, the only part I believe is mostly right):
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=13401#a13401
That's a lot of code and I feel I'm far from reaching the solution to my
problem. This code is getting ugly, and I'm somewhat desperate. Any help
will be greatly appreciated.
Fernando Henrique Sanches
On Tue, Nov 10, 2009 at 6:21 PM, Jason Dusek
You have to bootstrap your parser with something that takes an `MJVal` and updates the parser state. Here is a simple example:
http://github.com/jsnx/system-uuid/blob/master/Options.hs
This is a parser for command line options. It parses a list of `String`s, not `Char`s (because `argv` is `char**` and not `char*`, right?) so we introduce `stringPrim` and then build up the primitives from that.
-- Jason Dusek

Hi Fernando Which version of Parsec are you using, the one that ships with GHC or Parsec 3? I would imagine whichever you one you are using you have the imports wrong for these modules: import Text.Parsec.Pos import Text.Parsec.Prim should be ... import Text.ParserCombinators.Parsec.Pos import Text.ParserCombinators.Parsec.Prim Also it seems like an applicative instance for Parsec's GenParser monad is missing. This isn't defined for parsec-2.1.0.1, people seem to define it themselves: -- The Applicative instance for every Monad looks like this. instance Applicative (GenParser s a) where pure = return (<*>) = ap ... you will need `ap` from Control.Monad in scope. Also there is a pdf document for Parsec (should be available from Daan Leijen's home page) that covers separate scanners, it has quite a lot more documentation than Parsec's Haddock docs. For semantic analysis I'd highly recommend UUAG, it is well documented and used for a large compiler (EHC). There is also a version of Andrew Appel's Tiger language written with it that is much smaller and more comprehensible, the version on Hackage doesn't seem to contain the attribute grammar source but it is available here: http://www.cs.uu.nl/wiki/bin/view/HUT/TigerCompiler http://www.cs.uu.nl/wiki/bin/view/HUT/AttributeGrammarSystem Best wishes Stephen

Hi.
I believe I'm using parsec 3.0.1, although I already fixed the imports (even
thou they were working).
Thanks for the references, specially the parsec User Guide - it saved my
life, in 30 minutes I actually *understood* my mistake and fixed it. Now my
code and it compiles and makes sense, even though it's a bit useless, since
it has no return values [yet].
I'll take a look at UUAG and Tiger. However, since this is a simple college
assignment, I want my compiler to be as simple as possible but in a way I
understand what I'm doing, so I'll probably be doing it by hand.
Fernando Henrique Sanches
On Mon, Nov 30, 2009 at 2:58 PM, Stephen Tetley
Hi Fernando
Which version of Parsec are you using, the one that ships with GHC or Parsec 3?
I would imagine whichever you one you are using you have the imports wrong for these modules:
import Text.Parsec.Pos import Text.Parsec.Prim
should be ...
import Text.ParserCombinators.Parsec.Pos import Text.ParserCombinators.Parsec.Prim
Also it seems like an applicative instance for Parsec's GenParser monad is missing. This isn't defined for parsec-2.1.0.1, people seem to define it themselves:
-- The Applicative instance for every Monad looks like this. instance Applicative (GenParser s a) where pure = return (<*>) = ap
... you will need `ap` from Control.Monad in scope.
Also there is a pdf document for Parsec (should be available from Daan Leijen's home page) that covers separate scanners, it has quite a lot more documentation than Parsec's Haddock docs.
For semantic analysis I'd highly recommend UUAG, it is well documented and used for a large compiler (EHC). There is also a version of Andrew Appel's Tiger language written with it that is much smaller and more comprehensible, the version on Hackage doesn't seem to contain the attribute grammar source but it is available here:
http://www.cs.uu.nl/wiki/bin/view/HUT/TigerCompiler
http://www.cs.uu.nl/wiki/bin/view/HUT/AttributeGrammarSystem
Best wishes
Stephen

Hi Fernando. I tried this approach for a toy language as well, and I was unhappy with it. I have found that, with Parsec, it is best to *not* split your parsing completely into "tokenization" and "parsing" phases, but rather to interleave them. Instead of
tokenize :: Parser [MJVal]
make
token :: Parser MJVal
Then use something like the following:
tokenSatisfies :: (MJVal -> Bool) -> Parser MJVal tokenSatisfies f = try $ do t <- token if (f t) then return t else fail "No parse"
program :: Parser Program program = do tokenSatisfies (== Program_) programName <- identifier -- etc. return $ Program i ...
There is a bit of an inefficiency using "try"; you'll reparse the same
token multiple times for each failure branch of a "choice" branch, but
I've found this to be the simplest solution and parsing time rarely
dominates your running time.
-- ryan
On Tue, Nov 10, 2009 at 11:23 AM, Fernando Henrique Sanches
Hello. I'm currently implementing a MicroJava compiler for a college assignment (the implemented language was defined, the implementation language was of free choice). I've sucessfully implemented the lexer using Parsec. It has the type String -> Parser [MJVal], where MJVal are all the possible tokens. However, I don't know how to implement the parser, or at least how to do it keeping it distinguished from the lexer. For example, the MicroJava grammar specifies: Program = "program" ident {ConstDecl | VarDecl | ClassDecl} "{" {MethodDecl} "}". The natural solution (for me) would be: program = do string "program" programName <- identifier ... However, I can't do this because the file is already tokenized, what I have is something like: [Program_, identifier_ "testProgram", lBrace_, ...] for the example program: program testProgram { ... How should I implement the parser separated from the lexer? That is, how should I parse Tokens instead of Strings in the "Haskell way"? Fernando Henrique Sanches
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (5)
-
Fernando Henrique Sanches
-
Jason Dusek
-
Ryan Ingram
-
Sean Leather
-
Stephen Tetley