anyone interested in developing a Language.C library?

Hi all, If anyone is interested in developing a Language.C library, I've just completed a full C parser which we're using in c2hs. It covers all of C99 and all of the GNU C extensions that I have found used in practise, including the __attribute__ annotations. It can successfully parse the whole Linux kernel and all of the C files in all the system packages on my Gentoo installation. It's implemented as an alex lexer and a happy parser. The happy grammar has one shift/reduce conflict for the dangling if/then/else issue (which could be hidden by using precedence but it's clearer not to). So if someone is interested in developing a more widely usable Language.C library, I think this would be a good place to start. There's plenty to do however: * The c2hs C AST is ok but probably not enough for a general purpose library. * The parser currently uses some other c2hs infrastructure which would need disentangling to pull the parser out (mostly identifiers and unique name supply management). * It does not record everything into the parse tree, eg __attribute__s are parsed but ignored. * It does no semantic analysis after parsing (though other bits of c2hs to a very little) * In at least one place the parser is deliberately too liberal (to avoid ambiguities) which would require simple extra checks after parsing to detect. * The lexical syntax has not been checked against the spec fully, it is probably over-liberal in some cases. * I've not done much performance work, the lexer has not been seriously tuned, it still lexes via a String. Having said that, the performance is not at all bad, on a 3Ghz box it does ~20k lines/sec. * The parser error messages are terrible (it might be interesting to try porting from happy to frown for this purpose) There's probably more stuff, but that's what I can think of right now. So if anyone is interested then let me know, I can give some pointers (hopefully the useful kind, not the void * kind). You can get the code from the c2hs darcs repo: darcs get --partial http://darcs.haskell.org/c2hs/ The C parser bits are under c2hs/c/ Duncan Licensing: It's not 100% clear. At the moment it's marked as GPL, but it's derived from several sources so we need to be careful about that. Personally I'm happy to use LGPL. It derives from c2hs obviously, which is GPL, though we could enquire about re-licencing, especially since there is very little of c2hs stuff used in it any more. It also derives partly from James A. Roskind's C grammar (in particular the grammar of declarations). His copyright license is fairly liberal but this need double-checking. It also derives from the C99 spec and I read the comments in the gcc C parser as a guide to GNU C's extensions to the C grammar (no code or comments were copied however). Testing: I tested it thus far by writing a little gcc wrapper script, so you can build any ordinary bit of C software using this wrapper and it'll call gcc with the same args but it'll also try and parse the input file. It reports into a log file. I've not tried the gcc C parser testsuite. This approach is probably good for other tests like trying to see if parsing and pretty printing can round-trip correctly; if not identical token streams (since parsing drops redundant brackets etc) checking if gcc produces identical .S/.o files. Something that c2hs needs is to calculate sizes of types and structure member offsets correctly. This is also something that could be tested in this style, by comparing on thousands of example .c files with what gcc thinks.

Duncan Coutts wrote:
If anyone is interested in developing a Language.C library, I've just completed a full C parser which we're using in c2hs.
It covers all of C99 and all of the GNU C extensions that I have found used in practise, including the __attribute__ annotations. It can successfully parse the whole Linux kernel and all of the C files in all the system packages on my Gentoo installation.
Great work! Using this as a basis for a Language.C would be a really worthwile project.
Licensing: It's not 100% clear. At the moment it's marked as GPL, but it's derived from several sources so we need to be careful about that. Personally I'm happy to use LGPL. It derives from c2hs obviously, which is GPL, though
As far as I am concerned, LGPL is fine. Manuel

chak:
Duncan Coutts wrote:
If anyone is interested in developing a Language.C library, I've just completed a full C parser which we're using in c2hs.
It covers all of C99 and all of the GNU C extensions that I have found used in practise, including the __attribute__ annotations. It can successfully parse the whole Linux kernel and all of the C files in all the system packages on my Gentoo installation.
Great work!
Using this as a basis for a Language.C would be a really worthwile project.
I think people should be very interested in this. The ability to easily manipulate and generate C would quickly insert Haskell into another useful niche. There must *surely* be real money in writing nice Haskell programs that optimise/analyse/refactor/generate C code... -- Don

On 4/21/07, Donald Bruce Stewart
chak:
Duncan Coutts wrote:
If anyone is interested in developing a Language.C library, I've just completed a full C parser which we're using in c2hs.
It covers all of C99 and all of the GNU C extensions that I have found used in practise, including the __attribute__ annotations. It can successfully parse the whole Linux kernel and all of the C files in all the system packages on my Gentoo installation.
Great work!
Using this as a basis for a Language.C would be a really worthwile project.
I think people should be very interested in this.
The ability to easily manipulate and generate C would quickly insert Haskell into another useful niche. There must *surely* be real money in writing nice Haskell programs that optimise/analyse/refactor/generate C code...
Unfortunately the niche is not empty. There is an ocaml library called cil which is supposed to be pretty sweet for manipulating C code. But I still think a Haskell library would be a very good idea, and perhaps one can look at cil for inspiration. cil can be found here: http://hal.cs.berkeley.edu/cil/ Cheers, Josef

On Sat, 2007-04-21 at 12:04 +0200, Josef Svenningsson wrote:
Unfortunately the niche is not empty. There is an ocaml library called cil which is supposed to be pretty sweet for manipulating C code. But I still think a Haskell library would be a very good idea, and perhaps one can look at cil for inspiration.
cil can be found here: http://hal.cs.berkeley.edu/cil/
Yeah, I came across this recently. It's pretty decent looking. I briefly looked at their C parser (also implemented as a lex/yacc style lexer & parser). Theirs also covers Sun and MS C language extensions, that is Sun CC's pragmas and MS's numerous extensions. Sadly this didn't popup when I was googling for yacc style LALR(1) C grammars or I might have saved myself some time by porting their grammar to alex+happy. Duncan
participants (4)
-
dons@cse.unsw.edu.au
-
Duncan Coutts
-
Josef Svenningsson
-
Manuel M T Chakravarty