More Language.C work for Google's Summer of Code

Hello, I'm wondering whether there's anyone on the list with an interest in doing additional work on the Language.C library for the Summer of Code. There are a few enhancements that I'd be very interested seeing, and I'd love be a mentor for such a project if there's a student interested in working on them. The first is to integrate preprocessing into the library. Currently, the library calls out to GCC to preprocess source files before parsing them. This has some unfortunate consequences, however, because comments and macro information are lost. A number of program analyses could benefit from metadata encoded in comments, because C doesn't have any sort of formal annotation mechanism, but in the current state we have to resort to ugly hacks (at best) to get at the contents of comments. Also, effective diagnostic messages need to be closely tied to original source code. In the presence of pre-processed macros, column number information is unreliable, so it can be difficult to describe to a user exactly what portion of a program a particular analysis refers to. An integrated preprocessor could retain comments and remember information about macros, eliminating both of these problems. The second possible project is to create a nicer interface for traversals over Language.C ASTs. Currently, the symbol table is built to include only information about global declarations and those other declarations currently in scope. Therefore, when performing multiple traversals over an AST, each traversal must re-analyze all global declarations and the entire AST of the function of interest. A better solution might be to build a traversal that creates a single symbol table describing all declarations in a translation unit (including function- and block-scoped variables), for easy reference during further traversals. It may also be valuable to have this traversal produce a slightly-simplified AST in the process. I'm not thinking of anything as radical as the simplifications performed by something like CIL, however. It might simply be enough to transform variable references into a form suitable for easy lookup in a complete symbol table like I've just described. Other simple transformations such as making all implicit casts explicit, or normalizing compound initializers, could also be good. A third possibility, which would probably depend on the integrated preprocessor, would be to create an exact pretty-printer. That is, a pretty-printing function such that pretty . parse is the identity. Currently, parse . pretty should be the identity, but it's not true the other way around. An exact pretty-printer would be very useful in creating rich presentations of C source code --- think LXR on steroids. If you're interested in any combination of these, or anything similar, let me know. The deadline is approaching quickly, but I'd be happy to work together with a student to flesh any of these out into a full proposal. Thanks, Aaron -- Aaron Tomb Galois, Inc. (http://www.galois.com) atomb@galois.com Phone: (503) 808-7206 Fax: (503) 350-0833

I tried to devise a C preprocessor, but then I figured out that I
could write something like that:
---------------------------
#define A(arg) A_start (arg) A_end
#define A_start "this is A_start definition."
#define A_end "this is A_end definition."
A (
#undef A_start
#define A_start A_end
)
---------------------------
gcc preprocesses it into the following:
---------------------------
# 1 "a.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "a.c"
"this is A_end definition." () "this is A_end definition."
---------------------------
Another woes are filenames in angle brackets for #include. They
require special case for tokenizer.
So I given it (fully compliant C preprocessor) up. ;)
Other than that, C preprocessor looks simple.
I hardly qualify as a student, though.
2010/3/30 Aaron Tomb
The first is to integrate preprocessing into the library. Currently, the library calls out to GCC to preprocess source files before parsing them. This has some unfortunate consequences, however, because comments and macro information are lost. A number of program analyses could benefit from metadata encoded in comments, because C doesn't have any sort of formal annotation mechanism, but in the current state we have to resort to ugly hacks (at best) to get at the contents of comments. Also, effective diagnostic messages need to be closely tied to original source code. In the presence of pre-processed macros, column number information is unreliable, so it can be difficult to describe to a user exactly what portion of a program a particular analysis refers to. An integrated preprocessor could retain comments and remember information about macros, eliminating both of these problems.
The second possible project is to create a nicer interface for traversals over Language.C ASTs. Currently, the symbol table is built to include only information about global declarations and those other declarations currently in scope. Therefore, when performing multiple traversals over an AST, each traversal must re-analyze all global declarations and the entire AST of the function of interest. A better solution might be to build a traversal that creates a single symbol table describing all declarations in a translation unit (including function- and block-scoped variables), for easy reference during further traversals. It may also be valuable to have this traversal produce a slightly-simplified AST in the process. I'm not thinking of anything as radical as the simplifications performed by something like CIL, however. It might simply be enough to transform variable references into a form suitable for easy lookup in a complete symbol table like I've just described. Other simple transformations such as making all implicit casts explicit, or normalizing compound initializers, could also be good.
A third possibility, which would probably depend on the integrated preprocessor, would be to create an exact pretty-printer. That is, a pretty-printing function such that pretty . parse is the identity. Currently, parse . pretty should be the identity, but it's not true the other way around. An exact pretty-printer would be very useful in creating rich presentations of C source code --- think LXR on steroids.
If you're interested in any combination of these, or anything similar, let me know. The deadline is approaching quickly, but I'd be happy to work together with a student to flesh any of these out into a full proposal.
Thanks, Aaron
-- Aaron Tomb Galois, Inc. (http://www.galois.com) atomb@galois.com Phone: (503) 808-7206 Fax: (503) 350-0833
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 30 March 2010 18:55, Serguey Zefirov
Other than that, C preprocessor looks simple.
Ah no - apparently anything but simple. You might want to see Jean-Marie Favre's (very readable, amusing) papers on subject. Much of the behaviour of CPP is not defined and often inaccurately described, certainly it wouldn't appear to make an ideal one summer, student project. http://megaplanet.org/jean-marie-favre/papers/CPPDenotationalSemantics.pdf There are some others as well from his home page. Best wishes Stephen

Stephen Tetley
Much of the behaviour of CPP is not defined and often inaccurately described, certainly it wouldn't appear to make an ideal one summer, student project.
If you get http://ldeniau.web.cern.ch/ldeniau/cos.html to work, virtually everything else should work, too. Macro languages haven't been in fashion in the last decades, so you have to locate a veritable fan to work on this. There are, after all, still people writing TeX macros. There's got to be some CPP zealots, out there. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

On 19:54 Tue 30 Mar , Stephen Tetley wrote:
On 30 March 2010 18:55, Serguey Zefirov
wrote: Other than that, C preprocessor looks simple.
Ah no - apparently anything but simple.
I would describe it as "simple but somewhat annoying". This means that guessing at its specification will not result in anything resembling a correct implementation, but reading the specification and implementing accordingly is straightforward. Probably the hardest part is expression evaluation.
You might want to see Jean-Marie Favre's (very readable, amusing) papers on subject. Much of the behaviour of CPP is not defined and often inaccurately described, certainly it wouldn't appear to make an ideal one summer, student project.
The only specification of the C preprocessor that matters is the one contained in the specification of the C programming language. The accuracy of any other description of it is not relevant. C is quite possibly the language with the greatest quantity of inaccurate descriptions in existence (scratch that, C++ is likely worse). As with most of the C programming language, a lot of the behaviour is implementation-defined or even undefined, as you suggest. For example: /* implementation-defined */ #pragma launch_missiles /* undefined */ #define explosion defined #if explosion # pragma launch_missiles #endif This makes a preprocessor /easier/ to implement, because in these cases the implementer can do /whatever she wants/, including doing nothing or starting the missile launch procedure. In the implementation-defined case, the implementor must additionally write the decision down somewhere, i.e. "Upon execution of a #pragma launch_missiles directive, all missiles are launched".
http://megaplanet.org/jean-marie-favre/papers/CPPDenotationalSemantics.pdf
If this paper had criticised the actual C standard as opposed to a working draft, it would have been easier to take it seriously. I find the published standard quite clear about the requirements of a C preprocessor. Nevertheless, assuming that the complaints of the paper remain valid, it appears to boil down to "The C is preprocessor is weird, and one must read its whole specification to understand all of it". It also seems to contain a bit of "The C standard does not precisely describe the GNU C preprocessor". This work is certainly within the scope of a summer project. -- Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

Stephen Tetley wrote:
Much of the behaviour of CPP is not defined and often inaccurately described, certainly it wouldn't appear to make an ideal one summer, student project.
But to give Language.C integrated support for preprocessing, one needn't implement CPP. They only need to implement the right API for a preprocessor to communicate with the parser/analyzer. Considering all the folks outside of C who use the CPP *cough*Haskell*cough* having a stand-alone CPP would be good in its own right. In fact, I seem to recall there's already one of those floating around somewhere... ;) I think it'd be far cooler and more useful to give Language.C integrated preprocessor support without hard-wiring it to the CPP. Especially given as there are divergent semantics for different CPP implementations, and given we could easily imagine wanting to use another preprocessor (e.g., for annotations, documentation, etc) -- Live well, ~wren

(sorry for the dupe aaron! forgot to add haskell-cafe to senders list!)
Perhaps the best course of action would be to try and extend cpphs to
do things like this? From the looks of the interface, it can already
do some of these things e.g. do not strip comments from a file:
http://hackage.haskell.org/packages/archive/cpphs/1.11/doc/html/Language-Pre...
Malcolm would have to attest to how complete it is w.r.t. say, gcc's
preprocessor, but if this were to be a SOC project, extending cpphs to
include needed functionality would probably be much more realistic
than writing a new one.
On Tue, Mar 30, 2010 at 12:30 PM, Aaron Tomb
Hello,
I'm wondering whether there's anyone on the list with an interest in doing additional work on the Language.C library for the Summer of Code. There are a few enhancements that I'd be very interested seeing, and I'd love be a mentor for such a project if there's a student interested in working on them.
The first is to integrate preprocessing into the library. Currently, the library calls out to GCC to preprocess source files before parsing them. This has some unfortunate consequences, however, because comments and macro information are lost. A number of program analyses could benefit from metadata encoded in comments, because C doesn't have any sort of formal annotation mechanism, but in the current state we have to resort to ugly hacks (at best) to get at the contents of comments. Also, effective diagnostic messages need to be closely tied to original source code. In the presence of pre-processed macros, column number information is unreliable, so it can be difficult to describe to a user exactly what portion of a program a particular analysis refers to. An integrated preprocessor could retain comments and remember information about macros, eliminating both of these problems.
The second possible project is to create a nicer interface for traversals over Language.C ASTs. Currently, the symbol table is built to include only information about global declarations and those other declarations currently in scope. Therefore, when performing multiple traversals over an AST, each traversal must re-analyze all global declarations and the entire AST of the function of interest. A better solution might be to build a traversal that creates a single symbol table describing all declarations in a translation unit (including function- and block-scoped variables), for easy reference during further traversals. It may also be valuable to have this traversal produce a slightly-simplified AST in the process. I'm not thinking of anything as radical as the simplifications performed by something like CIL, however. It might simply be enough to transform variable references into a form suitable for easy lookup in a complete symbol table like I've just described. Other simple transformations such as making all implicit casts explicit, or normalizing compound initializers, could also be good.
A third possibility, which would probably depend on the integrated preprocessor, would be to create an exact pretty-printer. That is, a pretty-printing function such that pretty . parse is the identity. Currently, parse . pretty should be the identity, but it's not true the other way around. An exact pretty-printer would be very useful in creating rich presentations of C source code --- think LXR on steroids.
If you're interested in any combination of these, or anything similar, let me know. The deadline is approaching quickly, but I'd be happy to work together with a student to flesh any of these out into a full proposal.
Thanks, Aaron
-- Aaron Tomb Galois, Inc. (http://www.galois.com) atomb@galois.com Phone: (503) 808-7206 Fax: (503) 350-0833
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- - Austin

Yes, that would definitely be one productive way forward. One concern is that Language.C is BSD-licensed (and it would be nice to keep it that way), and cpphs is LGPL. However, if cpphs remained a separate program, producing C + extra stuff as output, and the Language.C parser understood the extra stuff, this could accomplish what I'm interested in. It would be interesting, even, to just extend the Language.C parser to support comments, and to tell cpphs to leave them in. There's also another pre-processor, mcpp [1], that is quite featureful and robust, and which supports an output mode with special syntax describing the origin of the code resulting from macro expansion. Aaron [1] http://mcpp.sourceforge.net/ On Mar 30, 2010, at 12:14 PM, austin seipp wrote:
(sorry for the dupe aaron! forgot to add haskell-cafe to senders list!)
Perhaps the best course of action would be to try and extend cpphs to do things like this? From the looks of the interface, it can already do some of these things e.g. do not strip comments from a file:
http://hackage.haskell.org/packages/archive/cpphs/1.11/doc/html/Language-Pre...
Malcolm would have to attest to how complete it is w.r.t. say, gcc's preprocessor, but if this were to be a SOC project, extending cpphs to include needed functionality would probably be much more realistic than writing a new one.
On Tue, Mar 30, 2010 at 12:30 PM, Aaron Tomb
wrote: Hello,
I'm wondering whether there's anyone on the list with an interest in doing additional work on the Language.C library for the Summer of Code. There are a few enhancements that I'd be very interested seeing, and I'd love be a mentor for such a project if there's a student interested in working on them.
The first is to integrate preprocessing into the library. Currently, the library calls out to GCC to preprocess source files before parsing them. This has some unfortunate consequences, however, because comments and macro information are lost. A number of program analyses could benefit from metadata encoded in comments, because C doesn't have any sort of formal annotation mechanism, but in the current state we have to resort to ugly hacks (at best) to get at the contents of comments. Also, effective diagnostic messages need to be closely tied to original source code. In the presence of pre-processed macros, column number information is unreliable, so it can be difficult to describe to a user exactly what portion of a program a particular analysis refers to. An integrated preprocessor could retain comments and remember information about macros, eliminating both of these problems.
The second possible project is to create a nicer interface for traversals over Language.C ASTs. Currently, the symbol table is built to include only information about global declarations and those other declarations currently in scope. Therefore, when performing multiple traversals over an AST, each traversal must re-analyze all global declarations and the entire AST of the function of interest. A better solution might be to build a traversal that creates a single symbol table describing all declarations in a translation unit (including function- and block-scoped variables), for easy reference during further traversals. It may also be valuable to have this traversal produce a slightly-simplified AST in the process. I'm not thinking of anything as radical as the simplifications performed by something like CIL, however. It might simply be enough to transform variable references into a form suitable for easy lookup in a complete symbol table like I've just described. Other simple transformations such as making all implicit casts explicit, or normalizing compound initializers, could also be good.
A third possibility, which would probably depend on the integrated preprocessor, would be to create an exact pretty-printer. That is, a pretty-printing function such that pretty . parse is the identity. Currently, parse . pretty should be the identity, but it's not true the other way around. An exact pretty-printer would be very useful in creating rich presentations of C source code --- think LXR on steroids.
If you're interested in any combination of these, or anything similar, let me know. The deadline is approaching quickly, but I'd be happy to work together with a student to flesh any of these out into a full proposal.
Thanks, Aaron
-- Aaron Tomb Galois, Inc. (http://www.galois.com) atomb@galois.com Phone: (503) 808-7206 Fax: (503) 350-0833
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- - Austin _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Malcolm would have to attest to how complete it is w.r.t. say, gcc's preprocessor,
One concern is that Language.C is BSD-licensed (and it would be nice to keep it that way), and cpphs is LGPL. However, if cpphs remained a separate program, producing C + extra stuff as output, and
cpphs is intended to be as faithful to the CPP standard as possible, whilst still retaining the extra flexibility we want in a non-C environment, e.g. retaining the operator symbols //, /*, and */. If the behaviour of cpphs does not match gcc -E, then it is either a bug (please report it) or an intentional feature. Real CPP is rather horribly defined as a lexical analyser for C, so has a builtin notion of identifier, operator, etc, which is not so useful for all the other settings in which we just want to use conditional inclusion or macros. Also, CPP fully intermingles conditionals, file inclusion, and macro expansion, whereas cpphs makes a strenuous effort to separate those things into logical phases: first the conditionals and inclusions, then macro expansion. This separation makes it possible to run only one or other of the phases, which can occasionally be useful. the Language.C parser understood the extra stuff, this could accomplish what I'm interested in. As for licensing, yes, cpphs as a standalone binary, is GPL. The library version is LGPL. One misconception is that a BSD-licensed library cannot use an LGPL'd library - of course it can. You just need to ensure that everyone can update the LGPL'd part if they wish. And as I always state for all of my tools, if the licence is a problem for any user, contact me to negotiate terms. I'm perfectly willing to allow commercial distribution with exemption from some of the GPL obligations. (And I note in passing that other alternatives like gcc are also GPL'd.) Regards, Malcolm

I'd be very much interested in working on this library for GSoC. I'm
currently working on an idea for another project, but I'm not certain
how widely beneficial it would be. The preprocessor and
pretty-printing projects sound especially intriguing.
On Tue, Mar 30, 2010 at 1:30 PM, Aaron Tomb
Hello,
I'm wondering whether there's anyone on the list with an interest in doing additional work on the Language.C library for the Summer of Code. There are a few enhancements that I'd be very interested seeing, and I'd love be a mentor for such a project if there's a student interested in working on them.
The first is to integrate preprocessing into the library. Currently, the library calls out to GCC to preprocess source files before parsing them. This has some unfortunate consequences, however, because comments and macro information are lost. A number of program analyses could benefit from metadata encoded in comments, because C doesn't have any sort of formal annotation mechanism, but in the current state we have to resort to ugly hacks (at best) to get at the contents of comments. Also, effective diagnostic messages need to be closely tied to original source code. In the presence of pre-processed macros, column number information is unreliable, so it can be difficult to describe to a user exactly what portion of a program a particular analysis refers to. An integrated preprocessor could retain comments and remember information about macros, eliminating both of these problems.
The second possible project is to create a nicer interface for traversals over Language.C ASTs. Currently, the symbol table is built to include only information about global declarations and those other declarations currently in scope. Therefore, when performing multiple traversals over an AST, each traversal must re-analyze all global declarations and the entire AST of the function of interest. A better solution might be to build a traversal that creates a single symbol table describing all declarations in a translation unit (including function- and block-scoped variables), for easy reference during further traversals. It may also be valuable to have this traversal produce a slightly-simplified AST in the process. I'm not thinking of anything as radical as the simplifications performed by something like CIL, however. It might simply be enough to transform variable references into a form suitable for easy lookup in a complete symbol table like I've just described. Other simple transformations such as making all implicit casts explicit, or normalizing compound initializers, could also be good.
A third possibility, which would probably depend on the integrated preprocessor, would be to create an exact pretty-printer. That is, a pretty-printing function such that pretty . parse is the identity. Currently, parse . pretty should be the identity, but it's not true the other way around. An exact pretty-printer would be very useful in creating rich presentations of C source code --- think LXR on steroids.
If you're interested in any combination of these, or anything similar, let me know. The deadline is approaching quickly, but I'd be happy to work together with a student to flesh any of these out into a full proposal.
Thanks, Aaron
-- Aaron Tomb Galois, Inc. (http://www.galois.com) atomb@galois.com Phone: (503) 808-7206 Fax: (503) 350-0833
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

That's very good to hear! When it comes to preprocessing and exact printing, I think that there are various stages of completeness that we could support. 1) Add support for parsing comments to the Language.C parser. Keep using an external pre-processor but tell it to leave comments in the source code. The cpphs pre-processor can do this. The trickiest bit here would have to do with where to record the comments in the AST. What AST node is a given comment associate with? We could probably come up with some general rules, and perhaps certain comments, in weird locations, would still be ignored. 2) Support correct column numbers for source locations. This falls short of complete macro support, but covers one of the key problems that macros introduce. The mcpp preprocessor [1] has a special diagnostic mode where it adds special comments describing the origin of code that resulted from macro expansion. If the parser retained comments, we could use this information to help with exact pretty- printing. 3) Modify the pretty-printer to take position information into account when pretty-printing (at least optionally). As long as macro definitions themselves (as well as #ifdef, etc.) are not in the AST, the output will still not be exactly the same as the input, but it'll come closer. 4) Add full support for parsing and expanding macros internally, so that both macro definitions and expansions appear in the Language.C AST. This is probably a huge project, partly because macros do not have to obey the tree structure of the C language in any way. This is perhaps beyond the scope of a summer project, but the other steps could help prepare for it in the future, and still fully address some of the problems caused by the preprocessor along the way. Do you think you'd be interested in some subset or variation of 1, 2, and 3? Are there other ideas you have? Things I've missed? Things you'd do differently? Thanks, Aaron [1] http://mcpp.sourceforge.net/ On Mar 30, 2010, at 1:46 PM, Edward Amsden wrote:
I'd be very much interested in working on this library for GSoC. I'm currently working on an idea for another project, but I'm not certain how widely beneficial it would be. The preprocessor and pretty-printing projects sound especially intriguing.
On Tue, Mar 30, 2010 at 1:30 PM, Aaron Tomb
wrote: Hello,
I'm wondering whether there's anyone on the list with an interest in doing additional work on the Language.C library for the Summer of Code. There are a few enhancements that I'd be very interested seeing, and I'd love be a mentor for such a project if there's a student interested in working on them.
The first is to integrate preprocessing into the library. Currently, the library calls out to GCC to preprocess source files before parsing them. This has some unfortunate consequences, however, because comments and macro information are lost. A number of program analyses could benefit from metadata encoded in comments, because C doesn't have any sort of formal annotation mechanism, but in the current state we have to resort to ugly hacks (at best) to get at the contents of comments. Also, effective diagnostic messages need to be closely tied to original source code. In the presence of pre-processed macros, column number information is unreliable, so it can be difficult to describe to a user exactly what portion of a program a particular analysis refers to. An integrated preprocessor could retain comments and remember information about macros, eliminating both of these problems.
The second possible project is to create a nicer interface for traversals over Language.C ASTs. Currently, the symbol table is built to include only information about global declarations and those other declarations currently in scope. Therefore, when performing multiple traversals over an AST, each traversal must re-analyze all global declarations and the entire AST of the function of interest. A better solution might be to build a traversal that creates a single symbol table describing all declarations in a translation unit (including function- and block-scoped variables), for easy reference during further traversals. It may also be valuable to have this traversal produce a slightly-simplified AST in the process. I'm not thinking of anything as radical as the simplifications performed by something like CIL, however. It might simply be enough to transform variable references into a form suitable for easy lookup in a complete symbol table like I've just described. Other simple transformations such as making all implicit casts explicit, or normalizing compound initializers, could also be good.
A third possibility, which would probably depend on the integrated preprocessor, would be to create an exact pretty-printer. That is, a pretty-printing function such that pretty . parse is the identity. Currently, parse . pretty should be the identity, but it's not true the other way around. An exact pretty-printer would be very useful in creating rich presentations of C source code --- think LXR on steroids.
If you're interested in any combination of these, or anything similar, let me know. The deadline is approaching quickly, but I'd be happy to work together with a student to flesh any of these out into a full proposal.
Thanks, Aaron
-- Aaron Tomb Galois, Inc. (http://www.galois.com) atomb@galois.com Phone: (503) 808-7206 Fax: (503) 350-0833
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Mar 30, 2010 at 5:14 PM, Aaron Tomb
That's very good to hear!
When it comes to preprocessing and exact printing, I think that there are various stages of completeness that we could support.
1) Add support for parsing comments to the Language.C parser. Keep using an external pre-processor but tell it to leave comments in the source code. The cpphs pre-processor can do this. The trickiest bit here would have to do with where to record the comments in the AST. What AST node is a given comment associate with? We could probably come up with some general rules, and perhaps certain comments, in weird locations, would still be ignored.
2) Support correct column numbers for source locations. This falls short of complete macro support, but covers one of the key problems that macros introduce. The mcpp preprocessor [1] has a special diagnostic mode where it adds special comments describing the origin of code that resulted from macro expansion. If the parser retained comments, we could use this information to help with exact pretty-printing.
3) Modify the pretty-printer to take position information into account when pretty-printing (at least optionally). As long as macro definitions themselves (as well as #ifdef, etc.) are not in the AST, the output will still not be exactly the same as the input, but it'll come closer.
4) Add full support for parsing and expanding macros internally, so that both macro definitions and expansions appear in the Language.C AST. This is probably a huge project, partly because macros do not have to obey the tree structure of the C language in any way. This is perhaps beyond the scope of a summer project, but the other steps could help prepare for it in the future, and still fully address some of the problems caused by the preprocessor along the way.
I haven't looked at the C spec on macros, but I'm pretty motivated and would like to shoot for a big project.
Do you think you'd be interested in some subset or variation of 1, 2, and 3? Are there other ideas you have? Things I've missed? Things you'd do differently?
I'm very interested in all 3 of them, and actually somewhat in #4, though I'll have to do some reading to understand why you're saying it's such a big undertaking.
Thanks, Aaron
[1] http://mcpp.sourceforge.net/
On Mar 30, 2010, at 1:46 PM, Edward Amsden wrote:
I'd be very much interested in working on this library for GSoC. I'm currently working on an idea for another project, but I'm not certain how widely beneficial it would be. The preprocessor and pretty-printing projects sound especially intriguing.
On Tue, Mar 30, 2010 at 1:30 PM, Aaron Tomb
wrote: Hello,
I'm wondering whether there's anyone on the list with an interest in doing additional work on the Language.C library for the Summer of Code. There are a few enhancements that I'd be very interested seeing, and I'd love be a mentor for such a project if there's a student interested in working on them.
The first is to integrate preprocessing into the library. Currently, the library calls out to GCC to preprocess source files before parsing them. This has some unfortunate consequences, however, because comments and macro information are lost. A number of program analyses could benefit from metadata encoded in comments, because C doesn't have any sort of formal annotation mechanism, but in the current state we have to resort to ugly hacks (at best) to get at the contents of comments. Also, effective diagnostic messages need to be closely tied to original source code. In the presence of pre-processed macros, column number information is unreliable, so it can be difficult to describe to a user exactly what portion of a program a particular analysis refers to. An integrated preprocessor could retain comments and remember information about macros, eliminating both of these problems.
The second possible project is to create a nicer interface for traversals over Language.C ASTs. Currently, the symbol table is built to include only information about global declarations and those other declarations currently in scope. Therefore, when performing multiple traversals over an AST, each traversal must re-analyze all global declarations and the entire AST of the function of interest. A better solution might be to build a traversal that creates a single symbol table describing all declarations in a translation unit (including function- and block-scoped variables), for easy reference during further traversals. It may also be valuable to have this traversal produce a slightly-simplified AST in the process. I'm not thinking of anything as radical as the simplifications performed by something like CIL, however. It might simply be enough to transform variable references into a form suitable for easy lookup in a complete symbol table like I've just described. Other simple transformations such as making all implicit casts explicit, or normalizing compound initializers, could also be good.
A third possibility, which would probably depend on the integrated preprocessor, would be to create an exact pretty-printer. That is, a pretty-printing function such that pretty . parse is the identity. Currently, parse . pretty should be the identity, but it's not true the other way around. An exact pretty-printer would be very useful in creating rich presentations of C source code --- think LXR on steroids.
If you're interested in any combination of these, or anything similar, let me know. The deadline is approaching quickly, but I'd be happy to work together with a student to flesh any of these out into a full proposal.
Thanks, Aaron
-- Aaron Tomb Galois, Inc. (http://www.galois.com) atomb@galois.com Phone: (503) 808-7206 Fax: (503) 350-0833
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Mar 30, 2010 at 7:30 PM, Aaron Tomb
Hello,
I'm wondering whether there's anyone on the list with an interest in doing additional work on the Language.C library for the Summer of Code. There are a few enhancements that I'd be very interested seeing, and I'd love be a mentor for such a project if there's a student interested in working on them.
Here's another suggestion: A transformer to convert Language.C's AST to RTL, thus hiding a lot of tedious details like structures, case statements, variable declarations, typedefs, etc. I started writing a model checker [1] based on Language.C, but got so bogged down in all the details of C I lost interest. -Tom [1] http://hackage.haskell.org/package/afv

On Mar 30, 2010, at 3:16 PM, Tom Hawkins wrote:
On Tue, Mar 30, 2010 at 7:30 PM, Aaron Tomb
wrote: Hello,
I'm wondering whether there's anyone on the list with an interest in doing additional work on the Language.C library for the Summer of Code. There are a few enhancements that I'd be very interested seeing, and I'd love be a mentor for such a project if there's a student interested in working on them.
Here's another suggestion: A transformer to convert Language.C's AST to RTL, thus hiding a lot of tedious details like structures, case statements, variable declarations, typedefs, etc.
I started writing a model checker [1] based on Language.C, but got so bogged down in all the details of C I lost interest.
I would also love to have something along these lines, and would be happy to mentor such a project. On a related note, I have some code sitting around that converts Language.C ASTs into a variant of Guarded Commands, and I expect I'll release that at some point. For the moment, it's a little too intimately tied to the program it's part of, though. Aaron
participants (10)
-
Aaron Tomb
-
Achim Schneider
-
austin seipp
-
Edward Amsden
-
Malcolm Wallace
-
Nick Bowler
-
Serguey Zefirov
-
Stephen Tetley
-
Tom Hawkins
-
wren ng thornton