Compilers: Why do we need a core language?

Hello, I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?

Hey, Here are some interesting links you that might help answer your question. http://www.aosabook.org/en/ghc.html http://www.youtube.com/watch?v=7NPBrWDzO2A Regards, José On 11/20/2012 12:54 PM, citb@lavabit.com wrote:
Hello,
I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Le Tue, 20 Nov 2012 06:54:25 -0500 (EST), citb@lavabit.com a écrit :
Hello,
I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?
What would be the point in doing so?
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi, Am Dienstag, den 20.11.2012, 06:54 -0500 schrieb citb@lavabit.com:
I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?
I would expect that even if you were to implement a compiler for full Haskell itself, you would end up generating a core language. E.g, where clauses can be implemented using let. So after you have implemented let you are likely to not copy the code but rather transform your where clause into a let clause. And alas, you suddenly have introduced a core language (Haskell without where clauses). This way you will end up with a very reduced subset of Haskell. Then you might also notice that some features are similar to implement and you can merge the code if you add some feature to your language, making it a bit more expressive. Then your core language will be more than just a subset of Haskell. As you can see, you will have a hard time preventing yourself from introducing a core language. Greetings, Joachim -- Joachim Breitner e-Mail: mail@joachim-breitner.de Homepage: http://www.joachim-breitner.de Jabber-ID: nomeata@joachim-breitner.de

On Tue, Nov 20, 2012 at 6:54 AM,
I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?
You might note that gcc does the same thing internally; its core seems to be inspired by S-expressions. (It's also somewhat lower level than GHC's core, but then C is itself lower level.) As for why: * It's a way to remove redundancy and simplify implementation. * Sometimes it's easier to rephrase a language feature in terms of another similar feature, instead of duplicating code or making that code sufficiently reusable to apply easily in multiple not-quite-identical places. * Sometimes it's because the language definition specifies a translation that describes the behavior of a language feature (see, for example, do blocks and automatically derived typeclasses in Haskell) and the simplest way to ensure compliance with the standard is to do the same translation in the compiler. * It can also be easier to apply high level optimization techniques; if you go straight from the highest level code to the lowest level, you are likely to miss optimization opportunities that are only revealed (or only sanely implementable) at intermediate levels. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix/linux, openafs, kerberos, infrastructure http://sinenomine.net

Is it impossible (very hard) to directly translate high-level language into machine code?
There's a context to your question I don't understand, so let me ask:
Wouldn't it be easier to break a big step into smaller baby steps?
And if it's indeed easier why wouldn't you choose that route?
-- Kim-Ee
On Tue, Nov 20, 2012 at 6:54 PM,
Hello,
I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 12-11-20 06:54 AM, citb@lavabit.com wrote:
I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?
Let the overall distance be fixed. More steps over that same distance means each step is smaller, easier to design, easier to understand, easier to correct if there are mistakes. Forget code generation. Just parse and validate and then discard. Already there are like 4 steps. Translate character sequence into token sequence. Translate token sequence into grammar tree, while checking grammar. Translate grammar tree into declaration groups, identifier lookup tables, etc., while checking whether every used identifier is declared or imported. Translate those groups and tables into type-annotated groups and tables, while checking types. Whew! After 4 steps of translating this to that, we still haven't reached the core language! Why? Because... have you ever tried to write a type-checker for character sequence? I'm sure some mad genius can do it, but I don't want to be that mad genius.

* It can also be easier to apply high level optimization techniques; if you go straight from the highest level code to the lowest level, you are likely to miss optimization opportunities that are only revealed (or only sanely implementable) at intermediate levels.
Not to mention that since optimizations are often expressed as replacements (note how stream fusions are implemented), optimizing after desugaring reduces the number of patterns an optimization needs to match against. By my understanding, GHC does not compile to core and then reparse the core file; 'core' is merely the name given to a certain stage in syntax tree manipulations. Thus, a compiler with no analogue of core would forego all syntax tree manipulations, instead taking the raw parse tree and running the code generator on that. And while that can be done for a sufficiently simple language, for a language with the complexity of Haskell I dare say that it is impossible:
Because... have you ever tried to write a type-checker for character sequence?
I'm sure some mad genius can do it, but I don't want to be that mad genius.
-- Ian Sturdy sturdyi12@mail.wlu.edu ________________________________________ From: haskell-cafe-bounces@haskell.org [haskell-cafe-bounces@haskell.org] on behalf of Albert Y. C. Lai [trebla@vex.net] Sent: Tuesday, November 20, 2012 12:47 PM To: haskell-cafe@haskell.org Subject: Re: [Haskell-cafe] Compilers: Why do we need a core language? On 12-11-20 06:54 AM, citb@lavabit.com wrote:
I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?
Let the overall distance be fixed. More steps over that same distance means each step is smaller, easier to design, easier to understand, easier to correct if there are mistakes. Forget code generation. Just parse and validate and then discard. Already there are like 4 steps. Translate character sequence into token sequence. Translate token sequence into grammar tree, while checking grammar. Translate grammar tree into declaration groups, identifier lookup tables, etc., while checking whether every used identifier is declared or imported. Translate those groups and tables into type-annotated groups and tables, while checking types. Whew! After 4 steps of translating this to that, we still haven't reached the core language! Why? Because... have you ever tried to write a type-checker for character sequence? I'm sure some mad genius can do it, but I don't want to be that mad genius. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 11/20/12 6:54 AM, citb@lavabit.com wrote:
Hello,
I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?
It is possible to remove stages in the standard compilation pipeline, and doing so can speed up compilation time. For example, Perl doesn't build an abstract syntax tree (for now-outdated performance reasons), and instead compiles the source language directly into bytecode (which is then interpreted by the runtime). This is one of the reasons why Perl is (or was?) so much faster than other interpreted languages like Python etc. But there are some big problems to beware of: * Not having a concrete representation for intermediate forms can rule out performing obvious optimizations. And I do mean *obvious* optimizations; I can talk more about this problem in Perl, if you really care. * Not having a concrete representation for intermediate forms means mixing together code from many different stages of the compilation process. This sort of spaghetti code is hard to maintain, and even harder to explain to new developers. * Not having a concrete representation for intermediate forms can lead to code duplication (in the compiler) because there's no convenient way to abstract over certain patterns. And, of course, repeating code is just begging for inconsistency bugs due to the maintenance burden of keeping all the copies in sync. All three points are major driving forces in having intermediate forms. Joachim Breitner gave some illustrations for why intermediate forms are "inevitable". But then, once you have intermediate forms, if you're interested in ensuring correctness and having a formal(izable) semantics, then it makes sense to try to turn those intermediate forms into an actual intermediate language. Intermediate forms are just an implementation detail, but intermediate languages can be reasoned about in the same ways as other languages. So it's more about shifting perspective in order to turn systems problems (implementation details) into language problems (semantics of the Core). Furthermore, if you're a PL person and really are trying to ensure correctness of your language (e.g., type safety), you want to try to make your proof obligation as small as possible. For convenience to programmers, source code is full of constructs which are all more or less equivalent. But this is a problem for making proofs because when we perform case analysis on an expression we have to deal with all those different syntactic forms. Whereas if you first compile everything down into a small core language, then the proof has far fewer syntactic forms it has to deal with and so the proof is much easier. Bear in mind that this isn't just a linear problem. If we have N different syntactic forms, then proving something like confluence will require proving O(N^2) cases since you're comparing two different terms. -- Live well, ~wren

On 24/11/2012, at 5:26 PM, wren ng thornton wrote:
On 11/20/12 6:54 AM, citb@lavabit.com wrote:
Hello,
I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?
It is possible to remove stages in the standard compilation pipeline, and doing so can speed up compilation time. For example, Perl doesn't build an abstract syntax tree (for now-outdated performance reasons), and instead compiles the source language directly into bytecode (which is then interpreted by the runtime). This is one of the reasons why Perl is (or was?) so much faster than other interpreted languages like Python etc.
I have found Perl anything from the same speed as AWK (reading and writing lots of data with hardly any processing) to 10 times slower than AWK (with respect to the 'mawk' implementation of AWK). The deeply saddening thing here is that there are lots of improvements I _know_ how to make to mawk to make it even faster, but the thing that stops me is the way mawk generates word-code directly without an AST. I don't know why Perl is direct-to-byte-code, but I'm pretty sure why mawk is direct-to-word-code and The One True Awk interprets its AST. AWK used to run on PDP-11s and for large source files had room for VM instructions or ASTs but not both at the same time. Designing a compiler for fast *compiling* leads to one kind of design; designing for fast *running* of generated code leads to another. And run times for scripting languages depends on things like the structure of hash tables, the quality of hashing functions, the cripplingly excessive richness of certain regular expression libraries, the memory management scheme, ...

On 11/25/12 11:08 PM, Richard O'Keefe wrote:
On 24/11/2012, at 5:26 PM, wren ng thornton wrote:
On 11/20/12 6:54 AM, citb@lavabit.com wrote:
Hello,
I know nothing about compilers and interpreters. I checked several books, but none of them explained why we have to translate a high-level language into a small (core) language. Is it impossible (very hard) to directly translate high-level language into machine code?
It is possible to remove stages in the standard compilation pipeline, and doing so can speed up compilation time. For example, Perl doesn't build an abstract syntax tree (for now-outdated performance reasons), and instead compiles the source language directly into bytecode (which is then interpreted by the runtime). This is one of the reasons why Perl is (or was?) so much faster than other interpreted languages like Python etc.
I have found Perl anything from the same speed as AWK (reading and writing lots of data with hardly any processing) to 10 times slower than AWK (with respect to the 'mawk' implementation of AWK).
Perhaps I was too glib in saying "other interpreted languages"; I certainly did not mean to include Sed and Awk (which I tend to consider shell scripting more than interpreted programming). Also, I'm only referring to the startup cost of parsing and compiling to bytecode[1], not any other overhead of the actual languages themselves. There are significant differences between the actual content of each language[2], but that's a linguistic issue rather than an issue of how to design and structure the compiler/interpreter. [1] For one-liners and short programs, this startup cost tends to dominate, which is why Perl does direct to bytecode. Perl was initially intended to be a combination/replacement for Sed, Awk, Bash, etc; it only turned into a full programming language later. That said, the overhead for starting the Perl runtime is much higher than for Sed and Awk, which is why many systems engineers still use Sed/Awk in their scripts. [2] A notorious but too poorly known example is the implementation of regexes where crufty old tools/languages like awk and grep vastly outperform modern tools like Perl and Python: http://swtch.com/~rsc/regexp/regexp1.html To say nothing of the implementations of strings, hash tables, etc; which you mentioned. -- Live well, ~wren
participants (10)
-
Albert Y. C. Lai
-
AUGER Cédric
-
Brandon Allbery
-
citb@lavabit.com
-
Joachim Breitner
-
José Lopes
-
Kim-Ee Yeoh
-
Richard O'Keefe
-
Sturdy, Ian
-
wren ng thornton