
Hi, (Please note this is coming from my own experience working with the LHC haskell compiler, as well as a compiler I'm currently working on in SML. I'm not an authority, but as another greenhorn compiler hacker I thought I might give some advice.) Excerpts from Loup Vaillant's message of Sat Mar 07 17:32:48 -0600 2009:
Ideally, I would very much like to compile to C.
The requirements are easily stated. My language must -> be lazy by default, -> support algebraic data types and case expressions (unless I can get away with encoding them as functions), -> have some kind of FFI with C (I suppose it imply some support for unboxed values).
There is also the first class functions thing, but lambda lifting is okay.
Unfortunately, after working on LHC, I can verifiably say (like all the researchers who wrote the papers which I *didn't* read beforehand,) that for a lot of purposes, C is an unsuitable fit for a compilers' target language. It works pretty well if you can make the source language execution model fit well enough, but if you can't, you've a price to pay (both in sanity and/or performance.) (On that note, I am currently of the opinion that most of LHC's major deficiencies, aside from a few parser bugs or some needed optimizations, comes from the fact that compiling to C is currently our only option; because of it, we have no exception handling or proper garbage collection at all. As well, the runtime system is a little needlessly 'clever' (if small and understandable) so it can deal with that.) However, since your goal is *not* efficiency, you will be happy to know that the issues with compiling to C (like garbage collection and exception handling) can be worked around viably. For garbage collection, please see. "Accurate Garbage Collection in an Uncooperative Environment" - http://citeseer.ist.psu.edu/538613.html This strategy is currently used in Mercury as well as Ben L.'s DDC language; on that note, I think if you spent some time looking through the runtime/generated code of DDC, you can see exactly what the paper is talking about, because it's actually a very simple strategy for holding onto GC roots: http://code.haskell.org/ddc/ddc-head/runtime/ However, it's reasons like this (C is really hard to compile to effectively) that Simon & co. have spent so much time on the C-- project. C-- is one of the backend languages used in GHC, and it is designed to be a 'uniform assembly language' for high level languages to compile to. You can find a lot of info on C-- here; I recommend the paper at the bottom to start off: http://research.microsoft.com/en-us/um/people/simonpj/papers/c--/ These papers should further illustrate the issues with compiling to C; for a pedagogical excursion, these issues can all be (simply) worked around for the most part like Henderson's paper illustrates, but they're worth keeping in mind, too. Hopefully those papers should help you concerning your compilation model and the route you would like to take. I can't say much about laziness, other than read Simon Peyton-Jones' actual full book (it's an amazing read): http://research.microsoft.com/en-us/um/people/simonpj/papers/slpj-book-1987/ That should help you concerning laziness/compilation etc.. As for the FFI, I don't really have any papers on 'how to implement an FFI', but Matthias Blume's paper might be of interest: "No-Longer-Foreign: Teaching an ML compiler to speak c "natively"" http://ttic.uchicago.edu/~blume/papers/nlffi-entcs.pdf libffi may also be worth investigating: http://sourceware.org/libffi/
I have chosen the STG machine because I thought i would be easier to get an FFI and case exprs out of it. Maybe I am wrong, and template instantiation is really easier. There is also the BRISK machine, and GRIN. But the paper I read were less readable to me than the three mentioned.
How far into GRIN have you looked? It is one of the intermediate languages for lhc/jhc, and for a low-level intermediate representation it works quite well I think; it is a strict, monadic intermediate language - being strict, we must still be able to model closures ('suspendeded applications' or thunks,) so we model closures/partial applications in a sort of 'defunctionalized' way (making grin a sort of 'defunctionalized intermediate language' really,) by turning them into something along the lines of an algebraic data type in GRIN. A good benefit of this is that you actually /don't/ have to write eval/apply yourself, because the compiler is supposed to generate it *for* you. In fact, this is a core part of GRIN - the generation of eval by the compiler is critical for many important optimizations to take place. It also makes the runtime a bit simpler. This comes at the price that GRIN is by design a whole-program intermediate form; in order to pull off any of these sophisticated transformations everything must be in memory at once. But as we have seen with LHC/JHC, it can make a *huge* difference (apps that are 10x smaller *and* 10x faster than ones created by GHC.) Having dealt with GRIN as I work on LHC (when time now permits...) I can say that it's a reasonable strategy and in practice it turns out pretty well, but if you're not into optimizing the code, then STG might be a better fit. I can't really give you that much advice on this area so much as you'll just have to choose. But GRIN does have at least two excellent references: Code Optimisation Techniques for Lazy Functional Languages - U. Boquist, http://www.cs.chalmers.se/~boquist/phd/index.html The GRIN Project: A Highly Optimising Back End For Lazy Functional Languages, U. Boquist, http://www.cs.chalmers.se/~boquist/ifl96-abstract.html Then again you said you're not interested in optimization, so perhaps I'm just preaching to the totally wrong choir... :) Austin