More Language.C work for Google's Summer of Code

30 Mar 2010

      Hello,

I'm wondering whether there's anyone on the list with an interest in  
doing additional work on the Language.C library for the Summer of  
Code. There are a few enhancements that I'd be very interested seeing,  
and I'd love be a mentor for such a project if there's a student  
interested in working on them.

The first is to integrate preprocessing into the library. Currently,  
the library calls out to GCC to preprocess source files before parsing  
them. This has some unfortunate consequences, however, because  
comments and macro information are lost. A number of program analyses  
could benefit from metadata encoded in comments, because C doesn't  
have any sort of formal annotation mechanism, but in the current state  
we have to resort to ugly hacks (at best) to get at the contents of  
comments. Also, effective diagnostic messages need to be closely tied  
to original source code. In the presence of pre-processed macros,  
column number information is unreliable, so it can be difficult to  
describe to a user exactly what portion of a program a particular  
analysis refers to. An integrated preprocessor could retain comments  
and remember information about macros, eliminating both of these  
problems.

The second possible project is to create a nicer interface for  
traversals over Language.C ASTs. Currently, the symbol table is built  
to include only information about global declarations and those other  
declarations currently in scope. Therefore, when performing multiple  
traversals over an AST, each traversal must re-analyze all global  
declarations and the entire AST of the function of interest. A better  
solution might be to build a traversal that creates a single symbol  
table describing all declarations in a translation unit (including  
function- and block-scoped variables), for easy reference during  
further traversals. It may also be valuable to have this traversal  
produce a slightly-simplified AST in the process. I'm not thinking of  
anything as radical as the simplifications performed by something like  
CIL, however. It might simply be enough to transform variable  
references into a form suitable for easy lookup in a complete symbol  
table like I've just described. Other simple transformations such as  
making all implicit casts explicit, or normalizing compound  
initializers, could also be good.

A third possibility, which would probably depend on the integrated  
preprocessor, would be to create an exact pretty-printer. That is, a  
pretty-printing function such that pretty . parse is the identity.  
Currently, parse . pretty should be the identity, but it's not true  
the other way around. An exact pretty-printer would be very useful in  
creating rich presentations of C source code --- think LXR on steroids.

If you're interested in any combination of these, or anything similar,  
let me know. The deadline is approaching quickly, but I'd be happy to  
work together with a student to flesh any of these out into a full  
proposal.

Thanks,
Aaron

-- 
Aaron Tomb
Galois, Inc. (http://www.galois.com)
atomb@galois.com
Phone: (503) 808-7206
Fax: (503) 350-0833

Aaron Tomb

Serguey Zefirov

Stephen Tetley

Achim Schneider

Nick Bowler

wren ng thornton

austin seipp

Aaron Tomb

Malcolm Wallace

Edward Amsden

Aaron Tomb

Edward Amsden

Tom Hawkins

Aaron Tomb

tags

participants (10)