
All, I posted some time ago about a plan to make a new C parser that would correctly deal with the typedef problem and then to try some automated testing by making a gcc wrapper. I've now mostly done that. Currently I can parse almost all of the Linux kernel, glibc, GNU binutils, and the C code produced by ghc when it compiles itself. I can also parse most of the artificial nasty typedef examples: typedef int A, B(A); I think I correctly deal with scope of typedef names and typedef names being shadowed by local variable declarations: typedef int A; A main () { A A; -- declares local var A as having type A A = A + 1; -- A now refers to the local var, not the type; } A x; -- end of scope, A reverts to referring to the typedef name. A slightly more tricky example of the same thing is where function parameters redeclare names: typedef int A; A main (A A) { A = A + 1; -- A now refers to the local var, not the type; } An example I don't deal with yet is: typedef int A; int A = sizeof(A); that is we shadow the typename A with the local var A, so the A in the initialiser should now refer to the local var not the typename. However currently the parser only does the shadowing after the ';', not after the '=' as would be required. This is because the production rule looks like: declaring_list : declaration_specifier declarator initializer_opt {% do doDeclIdent $1 $2 return $ CDecl $1 [($2, $3, Nothing)]) } where as we'd really like to say: declaring_list : declaration_specifier declarator {% doDeclIdent $1 $2 } initializer_opt { CDecl $1 [($2, $4, Nothing)]) } that is to modify the typedef set immediately after the declarator and before parsing the initializer. This is exactly how people do it with YACC but unfortunately happy does not support this form where actions are interspersed in the production. It's probably possible to refactor the grammar to do this with happy, though it's not totally obvious to me how. Dealing with '__attribute__'s: ============================== GNU C attributes are a real pain. They seem to be able to appear almost anywhere in a declaration. However adding them to the grammar has to be done very carefully to avoid introducing ambiguities (ie shift/reduce or reduce/reduce conflicts). So for the moment I've taken the approach of not parsing them at all but recognising and ignoring attributes in the lexer and not passing them on to the parser at all. This effectively treats attributes as whitespace and they can appear anywhere. This isn't a long term solution because eventually we'll need to recognise some attributes because they can affect structure layout and we need that for accurately calculating the size of types and offsets of structure members. C99 extensions: =============== I'm not sure I've got 100% coverage of C99 features, but at least I cover the ones used in the code I've seen so far. These are just the ones that I can remember adding: * mixed declarations and statements in a compound statement * compound literals * for statements with declarations: for (int n = 0; n < m; n++) {}; * wide character character and string literals * unsigned long long numeric literals * _Bool and _Complex basic types GNU extensions: =============== Again, just some of the GNU extensions that I can remember adding to the new parser compared to the extensions the previous parser recognised: * anonymous structs and unions inside other structs/unions. * thread local storage qualifier * old style and array range compound literal member designator * case ranges * computed gotos and address of labels * conditional operator with missing then part: "x ? : y" * empty structs * allow redundant ';' in several places * asm expressions * several built-in 'functions' that take type names as parameters Various of these extensions require extensions in the AST. Testing: ======== I've made a cc-wrapper program that you can use as if it were gcc. It calls gcc with the same args as it was called with. Then if it can figure out what the gcc args meant and it looks like gcc was being asked to compile a .c file then it tries to pre-process and parse the same .c file. It outputs results to a log file and puts detailed reports on parse failures into separate files (in a directory specified via an environment variable). So (in theory at least) one ought to be able to compile vast amounts of C code and find out which extensions are really used and break the parser. At the moment the it doesn't check that it correctly parses the code, only that it parses it at all. Extra checks as I mentioned before could include checking if the sizes of types and offsets of structure members computed by c2hs match those computed by gcc. Another test would be to pretty-print the C code again and see if gcc can still parse it and if it produces the same object code (that'd require fully parsing attributes). As I said, I've been testing this on the linux kernel (using allyesconfig) glibc and ghc. Last time I tried it against linux it failed on 7 out of 4406 .c files. All those failures were due to the same bug which I think I've now fixed. In glibc, a few files fail to parse due to their use of nested function declarations which is a rarely used GNU extension that I've not implemented yet. Plan: ===== So there are a few things to do, one is to continue testing large amounts of C code. I intend to include cc-wrapper in c2hs (or perhaps do it as a c2hs mode via a flag) so that users can do their own testing and provide a way to give detailed bug reports. I haven't checked yet that the new C parser doesn't break things when doing the full .chs -> .hs translations. Also, I've not extended the guts of c2hs to understand any of these new language extensions so trying to bind things which really use them may not work. Then I need to start merging the new parser into the darcs version of c2hs. I'll try and do this as a series of small understandable patches. If anyone wants to play with the code I've got right now rather than wait for things to get merged then do say. I could put up a tarball. Duncan

Duncan Coutts:
I posted some time ago about a plan to make a new C parser that would correctly deal with the typedef problem and then to try some automated testing by making a gcc wrapper. I've now mostly done that.
Currently I can parse almost all of the Linux kernel, glibc, GNU binutils, and the C code produced by ghc when it compiles itself. [..] So there are a few things to do, one is to continue testing large amounts of C code. I intend to include cc-wrapper in c2hs (or perhaps do it as a c2hs mode via a flag) so that users can do their own testing and provide a way to give detailed bug reports.
I haven't checked yet that the new C parser doesn't break things when doing the full .chs -> .hs translations. Also, I've not extended the guts of c2hs to understand any of these new language extensions so trying to bind things which really use them may not work.
Then I need to start merging the new parser into the darcs version of c2hs. I'll try and do this as a series of small understandable patches.
Wow, Duncan, you are a hero!! Thanks for all that fantastic work. Manuel

On Mon, 2007-02-19 at 01:43 +0000, Duncan Coutts wrote:
I posted some time ago about a plan to make a new C parser that would correctly deal with the typedef problem and then to try some automated testing by making a gcc wrapper. I've now mostly done that.
Currently I can parse almost all of the Linux kernel, glibc, GNU binutils, and the C code produced by ghc when it compiles itself. [...] Then I need to start merging the new parser into the darcs version of c2hs. I'll try and do this as a series of small understandable patches.
If anyone wants to play with the code I've got right now rather than wait for things to get merged then do say. I could put up a tarball. That's very nice! Problems with the current C parser are blocking me from actually using c2hs at the moment (and I failed at my attempts to fix it). If there's anything I could test, please let me know.
Cheers,
Jelmer
--
Jelmer Vernooij

On Mon, 2007-02-19 at 03:23 +0100, Jelmer Vernooij wrote:
On Mon, 2007-02-19 at 01:43 +0000, Duncan Coutts wrote:
I posted some time ago about a plan to make a new C parser that would correctly deal with the typedef problem and then to try some automated testing by making a gcc wrapper. I've now mostly done that.
Currently I can parse almost all of the Linux kernel, glibc, GNU binutils, and the C code produced by ghc when it compiles itself. [...] Then I need to start merging the new parser into the darcs version of c2hs. I'll try and do this as a series of small understandable patches.
If anyone wants to play with the code I've got right now rather than wait for things to get merged then do say. I could put up a tarball.
That's very nice! Problems with the current C parser are blocking me from actually using c2hs at the moment (and I failed at my attempts to fix it). If there's anything I could test, please let me know.
So I've been compiling various packages with this cc-wrapper. I could do with some help on that. Mind you, Gentoo makes this fairly easy to do. If you've got other sources you want it to work with then now's a good time to try. Here's the tarball of my current tree: http://haskell.org/~duncan/c2hs/c2hs-newparser.tar.gz (Tarball unpacks to just c2hs/) to test: $ cd c2hs/c2hs/c/tests $ make cc-wrapper $ mkdir /tmp/c2hs $ export C2HS_CC_LOGDIR=/tmp/c2hs then invoke cc-wrapper as if it were gcc. look in $C2HS_CC_LOGDIR for a .log file and error .report files. The .report files are self contained, that is they should have all the info necessary for someone else to reproduce the failure (since it contains the full pre-processed source). An improved cc-wrapper script would be helpful though. I wrote it in Haskell however there's a problem there. GHC's I/O libs always put all file descriptors into non-blocking mode and since the blocking/non-blocking status on open file descriptors is shared between processes (silly design imho) this can cause problems. For some compilations (especially when compiling glibc for some reason) I get this error from cc1 about stdout being "temporarily unavailable". I think the solution is to use a bash (or whatever) script to capture the gcc args, parse them and pass them onto the Haskell C parser prog. Patches welcome :-) Duncan
participants (3)
-
Duncan Coutts
-
Jelmer Vernooij
-
Manuel M T Chakravarty