
On Wed, 2005-05-25 at 11:44 +1000, Manuel M T Chakravarty wrote:
An Alex/Happy parser would be an option if it improves matters significantly. If you or anybody else has a go at it, please follow the C language definition closely, as does the existing lexer and parser (with comments stating to which sections in K&R the individual productions relate).
Intrim status update: I've started with the lexer, using alex. I've converted clause for clause the existing lexer, keeping the K&R language definition comments. The output is exactly the same as for the existing lexer on my test file of gtk.i (cpp output of gtk/gtk.h) which is 1014K. It's a tad faster and runs in minimal space. I timed how long it takes to find the length of the list of tokens (so there's no IO overhead). I used ghc -O for the lexer modules and all relevant dependent modules. On my old slow 500Mhz sparc, the existing lexer takes 7.8 seconds and needs 14Mb heap minimum while the alex lexer takes 1.8 seconds and runs in less than 1Mb heap. One issue I noticed with the existing lexer is that it seems to be strict, ie the whole token stream is built before it is returned. The alex lexer is lazy which would explain the difference in heap usage. This suggests the existing lexer could be made to perform better if it were made lazy.
Moreover, the module c2hs/base/syntax/ParserMonad.hs already provides a monad suitable for Happy.
I'll take a look. As for the lexer/parser interaction required for C, I guess the way to do this is to make the lexer monad keep the set of 'typedefed' identifiers and return the appropriate token type to the parser depending on membership of that set. The lexer monad should also expose an operation to the parser to add identifiers to the 'typedefed' set which the parser would use after parseing the appropriate kind of declaration. Duncan