Questions concerning the LLVM backend - i.e. 'proc point splitting'

Hello everybody, I've got some questions concerning the LLVM backend - i.e. the pass 'proc point splitting' in the cmm-pipeline. I'm new to this list as well as to computer science in general - at least in some sense. Being a third semester student of computer science, I'm not quite sure, whether I'm addressing the right list with this mail. Feel free to correct me, if not. I'm doing numerical stencil calculations which heavily depend on nested loops expressed as tail-recursive functions nested one in another. When compiling a module, GHC seems capable of emitting pretty good cmm-code. At least as long as the loop bodies do not contain any call to non-inlineable functions. In most cases 'loopification' fires and transforms tail-recursive calls into loops. When calling into non-inlineable functions, 'proc point splitting' is performed. This splits the loops into several mutually-recursive functions therefore significantly decreasing the possibilities of 'opt' in optimizing the code later on. This leads to pretty bad performance of the whole calculation. To avoid the presence of proc points within a loop body - and by doing so the splitting of the body - I currently try to avoid calls into non-inlineable functions by merging the nested recursive functions into one giant tail-recursive function. This works, as long as everything not defined in the module being compiled can be inlined. Apparently, this is not a long term solution. My questions: - The reason to perform 'proc point splitting' is, that LLVM IR lacks of the possibility to call into an arbitrary basic block of a function body from outside the function itself. Is this understanding right? - Are there any other possibilities to avoid 'proc point splitting' when using LLVM as a backend? - I recently read the article on an improved LLVM backend in the GHC wiki (https://ghc.haskell.org/trac/ghc/wiki/ImprovedLLVMBackend), but couldn't find any suggestions on the topic of 'proc point splitting'. Are there any plans to solve the problem either by teaching LLVM IR the lacking functionality or in any other way? I'd really like to contribute to GHC by working on this problem, but I'm not quite sure, if I'm capable of doing so. My background knowledge is still somewhat limited, but I'm working on it ;-) Thanks in advance for your help! Regards, David Spitzenberg PS.: Thanks to everyone for the great work you're doing on GHC!

David All this stuff is terribly paged-out for me. But I'd love someone to pay attention to some of this back-end stuff, so if you'd like to work there, I'd like to help you. David Terei was also thinking of getting rid of proc point splitting for LLVM (see attached email and the notes attached to it) Simon Marlow also was thinking about this; see his attached email. The main *reason* for establishing a proc-point is so that, when generating C code (which we don't do any more) you can take its address. E.g. foo() { ... Push &bar onto the stack (effectively a return address) Jump to thingumpy } bar() { ... } Really bar is part of foo, but we have make it a separate top level thing so that we can take the address of bar and push it on the stack. The trouble is that if there is a label inside bar that foo wants to jump to (i.e. without pushing &bar etc) then we have to make that label too into a proc-point, so that both foo and bar can see it. Urgh. In LLVM this probably is not true; presumably you can take the address of any label? Avoiding proc-point splitting would be GREAT. Simon | -----Original Message----- | From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of | David Spitzenberg | Sent: 12 November 2015 16:25 | To: ghc-devs@haskell.org | Subject: Questions concerning the LLVM backend - i.e. 'proc point | splitting' | | Hello everybody, | | I've got some questions concerning the LLVM backend - i.e. the pass | 'proc point splitting' in the cmm-pipeline. | | I'm new to this list as well as to computer science in general - at | least in some sense. Being a third semester student of computer | science, I'm not quite sure, whether I'm addressing the right list | with this mail. Feel free to correct me, if not. | | I'm doing numerical stencil calculations which heavily depend on | nested loops expressed as tail-recursive functions nested one in | another. When compiling a module, GHC seems capable of emitting pretty | good cmm-code. | At least as long as the loop bodies do not contain any call to non- | inlineable functions. In most cases 'loopification' fires and | transforms tail-recursive calls into loops. | | When calling into non-inlineable functions, 'proc point splitting' is | performed. This splits the loops into several mutually-recursive | functions therefore significantly decreasing the possibilities of | 'opt' | in optimizing the code later on. This leads to pretty bad performance | of the whole calculation. | | To avoid the presence of proc points within a loop body - and by doing | so the splitting of the body - I currently try to avoid calls into | non-inlineable functions by merging the nested recursive functions | into one giant tail-recursive function. This works, as long as | everything not defined in the module being compiled can be inlined. | Apparently, this is not a long term solution. | | | | My questions: | | - The reason to perform 'proc point splitting' is, that LLVM IR lacks | of the possibility to call into an arbitrary basic block of a function | body from outside the function itself. Is this understanding right? | | - Are there any other possibilities to avoid 'proc point splitting' | when using LLVM as a backend? | | - I recently read the article on an improved LLVM backend in the GHC | wiki (https://ghc.haskell.org/trac/ghc/wiki/ImprovedLLVMBackend), but | couldn't find any suggestions on the topic of 'proc point splitting'. | Are there any plans to solve the problem either by teaching LLVM IR | the lacking functionality or in any other way? | | | | I'd really like to contribute to GHC by working on this problem, but | I'm not quite sure, if I'm capable of doing so. My background | knowledge is still somewhat limited, but I'm working on it ;-) | | Thanks in advance for your help! | | Regards, | | David Spitzenberg | | PS.: Thanks to everyone for the great work you're doing on GHC! | _______________________________________________ | ghc-devs mailing list | ghc-devs@haskell.org | https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h | askell.org%2fcgi-bin%2fmailman%2flistinfo%2fghc- | devs&data=01%7c01%7csimonpj%40064d.mgd.microsoft.com%7cfe71402ab68e40b | 9777408d2eb7ddf90%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Bqkhoth | 1Ui6FRUwEbfHajdjx7g9bSrM25geCCDREdus%3d

Simon Peyton Jones
David
All this stuff is terribly paged-out for me. But I'd love someone to pay attention to some of this back-end stuff, so if you'd like to work there, I'd like to help you.
David Terei was also thinking of getting rid of proc point splitting for LLVM (see attached email and the notes attached to it)
Simon Marlow also was thinking about this; see his attached email.
The main *reason* for establishing a proc-point is so that, when generating C code (which we don't do any more) you can take its address. E.g.
foo() { ... Push &bar onto the stack (effectively a return address) Jump to thingumpy }
bar() { ... }
Really bar is part of foo, but we have make it a separate top level thing so that we can take the address of bar and push it on the stack.
The trouble is that if there is a label inside bar that foo wants to jump to (i.e. without pushing &bar etc) then we have to make that label too into a proc-point, so that both foo and bar can see it. Urgh.
In LLVM this probably is not true; presumably you can take the address of any label?
This is true. While labels themselves have function-local scope in LLVM, there is an expression, `blockaddress`, which allows you to take an address to a label in another function [1]. That being said, in reading through the documentation it's not at all clear to me that it would be safe to jump to such an address. In fact, given that the instruction that this expression is supposed to be used with, `indirectbr`, can only be used for local blocks, I suspect it is not. More information about this feature can be found here [2]. The jump issue aside, I don't know how you would deal with tables-next-to-code. The prefix data support that currently available in LLVM is attached to functions and I unfortunately don't see that changing any time soon. Ultimately it seems that trying to refer to labels defined in other functions is using LLVM against the way it was intended. One alternative would be to teach llvmGen to merge mutually recusive top-level functions into a single LLVM function during code generation. Otherwise I'm afraid you are stuck with either the status quo or attempting to improve on LLVM's own cross-procedure optimization ability. That being said, it sounds as though eliminating proc-point splitting would make for quite a nice project in the native code generator. Cheers, - Ben [1] http://llvm.org/docs/LangRef.html#addresses-of-basic-blocks [2] http://blog.llvm.org/2010/01/address-of-label-and-indirect-branches.html [3] http://llvm.org/docs/LangRef.html#prefix-data

Hello Simon, hello Ben, thank you very much to both of you for your encouraging answers! First of all, I've two questions concerning your answer, Ben. a)
The jump issue aside, I don't know how you would deal with tables-next-to-code. The prefix data support that currently available in LLVM is attached to functions and I unfortunately don't see that changing any time soon.
Looks like I messed things up with those info-tables. As far as I understood [1], these are part of the STG. Therefore, every function present at the STG-level needs a info-table. The code generator does generate info-tables for every proc-point after splitting them. But to me, this was a consequence of making them stand-alone functions. Could you explain the need of further info-tables for 'inner' proc-points (those, which are not the entry-block of a function) to me, please? b)
That being said, it sounds as though eliminating proc-point splitting would make for quite a nice project in the native code generator.
The documentation states, that proc-point splitting is only required by the LLVM backend (see [2]). Does the cmm-pipeline perform proc-point splitting even when using the native code generator? Second, some background information concerning my situation. As a student assistant, I'm part of a work group at university. They're doing research on loop-optimization (which is the reason why I'm doing stencil calculations in Haskell). The group uses LLVM (which is the reason why I'm using the LLVM backend). Recently I talked to them, trying to explain the reasons for GHC emitting code containing mutual recursion instead of loops under certain circumstances. They told me, that LLVM's cross-procedure optimization fails to merge the mutual recursion into a single loop because of the following. When pushing a continuation defined in the local LLVM module onto the stack and calling into a method defined externally, the analysis can't assure, that control eventually returns to the local continuation. To make control flow more apparent to LLVM, they came up with the following idea. Remark before showing the idea: I know that it greatly interferes with CPS. It leaves parts of the continuation-handling to the implicit LLVM call-stack. Further, this does not solve the problem of proc-point splitting at all. It is more a workaround than a solution. But it would greatly increase LLVM's possibilities to optimize code later on. That's why I found this idea worth mentioning. Especially because after reading your answer, Ben, I'm somewhat convinced that solving the problem of proc-point splitting is nearly impossible with the capabilities of LLVM IR available at the moment. Now back to the idea. As stated above, the major problem of LLVM seems to be inferring control flow when it comes to call proc-points. As it's hard to avoid proc-point splitting, we leave the cmm-pipeline 'as-is'. We modify the code, llvmGen produces when it comes to call proc-points instead. All other types of proc-points remain unchanged. Initial situation in pseudo cmm: foo() { ... call bar() returns to cont // bar defined externally cont: ... } After proc-point splitting, this is converted into LLVM IR similar to the following: @foo() { ... store @cont(), %Sp ; push continuation on the stack tail call @bar() ; @bar defined externally } @cont() { ... } To fix the above issue, we introduce a method, which restores the pointers Sp, Hp and the register R1 (among everything else, what has to be restored) and 'fakes' the continuation. Note, that '@restore' returns without calling into the continuation. This way, the call into the continuation can be made direct. The above code would be transformed into something similar to the following: @foo() { ... store @restore(), %Sp ; 'fake' continuation call @bar() ; br label restore restore: %Sp = load (getelementptr @environment 0, 0) ; restore what has to be ... ; restored tail call @cont() } @cont() { ... } @restore(%Sp, %Hp, %R1, ...) { store %Sp, (getelementptr @environment 0, 0) ; store everything into ... ; global variable ret void } @environment ; structure of type similar to {i64*, i64*, i64*, i64, i64, ..} As an alternative, the above transformation could be done by a LLVM pass. llvmGen could then annotate the continuation of calls call cite. This would greatly simplify the detection of the correct continuation. What do you think about this idea? Especially when thinking about the interference with CPS? Regards, David [1] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/GeneratedCode#Impo... [2] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/CodeGen#Secondstag... On 11/21/2015 12:00 AM, Ben Gamari wrote:
Simon Peyton Jones
writes: David
All this stuff is terribly paged-out for me. But I'd love someone to pay attention to some of this back-end stuff, so if you'd like to work there, I'd like to help you.
David Terei was also thinking of getting rid of proc point splitting for LLVM (see attached email and the notes attached to it)
Simon Marlow also was thinking about this; see his attached email.
The main *reason* for establishing a proc-point is so that, when generating C code (which we don't do any more) you can take its address. E.g.
foo() { ... Push &bar onto the stack (effectively a return address) Jump to thingumpy }
bar() { ... }
Really bar is part of foo, but we have make it a separate top level thing so that we can take the address of bar and push it on the stack.
The trouble is that if there is a label inside bar that foo wants to jump to (i.e. without pushing &bar etc) then we have to make that label too into a proc-point, so that both foo and bar can see it. Urgh.
In LLVM this probably is not true; presumably you can take the address of any label?
This is true. While labels themselves have function-local scope in LLVM, there is an expression, `blockaddress`, which allows you to take an address to a label in another function [1]. That being said, in reading through the documentation it's not at all clear to me that it would be safe to jump to such an address. In fact, given that the instruction that this expression is supposed to be used with, `indirectbr`, can only be used for local blocks, I suspect it is not. More information about this feature can be found here [2].
The jump issue aside, I don't know how you would deal with tables-next-to-code. The prefix data support that currently available in LLVM is attached to functions and I unfortunately don't see that changing any time soon.
Ultimately it seems that trying to refer to labels defined in other functions is using LLVM against the way it was intended. One alternative would be to teach llvmGen to merge mutually recusive top-level functions into a single LLVM function during code generation. Otherwise I'm afraid you are stuck with either the status quo or attempting to improve on LLVM's own cross-procedure optimization ability.
That being said, it sounds as though eliminating proc-point splitting would make for quite a nice project in the native code generator.
Cheers,
- Ben
[1] http://llvm.org/docs/LangRef.html#addresses-of-basic-blocks [2] http://blog.llvm.org/2010/01/address-of-label-and-indirect-branches.html [3] http://llvm.org/docs/LangRef.html#prefix-data

David Spitzenberg
Hello Simon, hello Ben,
thank you very much to both of you for your encouraging answers!
Hello again!
First of all, I've two questions concerning your answer, Ben.
a)
The jump issue aside, I don't know how you would deal with tables-next-to-code. The prefix data support that currently available in LLVM is attached to functions and I unfortunately don't see that changing any time soon.
Looks like I messed things up with those info-tables. As far as I understood [1], these are part of the STG. Therefore, every function present at the STG-level needs a info-table. The code generator does generate info-tables for every proc-point after splitting them. But to me, this was a consequence of making them stand-alone functions.
Could you explain the need of further info-tables for 'inner' proc-points (those, which are not the entry-block of a function) to me, please?
Ahh, yes. I believe you are right; there should be no need for info tables on these proc-points given that they don't correspond to anything in the STG program. My apologies for muddying things! snip
To make control flow more apparent to LLVM, they came up with the following idea. Remark before showing the idea: I know that it greatly interferes with CPS. It leaves parts of the continuation-handling to the implicit LLVM call-stack. Further, this does not solve the problem of proc-point splitting at all. It is more a workaround than a solution. But it would greatly increase LLVM's possibilities to optimize code later on. That's why I found this idea worth mentioning. Especially because after reading your answer, Ben, I'm somewhat convinced that solving the problem of proc-point splitting is nearly impossible with the capabilities of LLVM IR available at the moment. Now back to the idea.
Well, don't let me dissuade you: firstly, I'm a rather fallible human and may have missed something; secondly, the LLVM folks have been quite receptive to new ideas and there is likely a reason they left the door open to expressing inter-procedure block references with `blockaddress`. You might consider asking about this on the LLVM list, assuming this isn't beyond the scope of your project. I'll leave the rest for those with better understanding of the whole picture to address. Cheers, - Ben

David
| Could you explain the need of further info-tables for 'inner'
| proc-points (those, which are not the entry-block of a function) to
| me, please?
The main purpose of inner info tables is this. Given (case e of blah) we push a return address (for 'blah') on the stack and evaluate 'e'. During evaluation of 'e', garbage collection may take place. So the garbage collector needs to walk the stack to follow live pointers. Who knows what pointers are live? Answer: the code for 'blah' clearly knows what pointers are alive in the current stack frame, because it is going to access them. So the return address gets an info table that makes that implicit knowledge explicit, as a bitmap describing the liveness/pointer-hood of the stack frame.
Addresses that are only jumped to don’t need info tables.
It would be great if, as you learn about this stuff, you could add to the Commentary
https://ghc.haskell.org/trac/ghc/wiki/Commentary
to capture what you have learned.
| The documentation states, that proc-point splitting is only required
| by the LLVM backend (see [2]). Does the cmm-pipeline perform proc-
| point splitting even when using the native code generator?
I'm not certain, but I don't think so.
| As stated above, the major problem of LLVM seems to be inferring
| control flow when it comes to call proc-points. As it's hard to avoid
| proc-point splitting, we leave the cmm-pipeline 'as-is'. We modify the
I regard the whole proc-point thing as a major wart. I think we should work hard to eliminate it. So: another line of thought might be: how could we influence LLVM so that we didn't need proc-point splitting?
| To make control flow more apparent to LLVM, they came up with the
| following idea. Remark before showing the idea: I know that it greatly
I'm sorry to say that I didn't really understand the idea. Maybe someone else did? David Terei?
If you are seriously about investing effort in the back end, I'd be happy to help; in that case a Skype call is probably the best way.
Simon
| interferes with CPS. It leaves parts of the continuation-handling to
| the implicit LLVM call-stack. Further, this does not solve the problem
| of proc-point splitting at all. It is more a workaround than a
| solution.
| But it would greatly increase LLVM's possibilities to optimize code
| later on. That's why I found this idea worth mentioning. Especially
| because after reading your answer, Ben, I'm somewhat convinced that
| solving the problem of proc-point splitting is nearly impossible with
| the capabilities of LLVM IR available at the moment. Now back to the
| idea.
|
| As stated above, the major problem of LLVM seems to be inferring
| control flow when it comes to call proc-points. As it's hard to avoid
| proc-point splitting, we leave the cmm-pipeline 'as-is'. We modify the
| code, llvmGen produces when it comes to call proc-points instead. All
| other types of proc-points remain unchanged.
|
| Initial situation in pseudo cmm:
|
| foo() {
| ...
| call bar() returns to cont // bar defined externally
|
| cont:
| ...
| }
|
| After proc-point splitting, this is converted into LLVM IR similar to
| the following:
|
| @foo() {
| ...
| store @cont(), %Sp ; push continuation on the stack tail call @bar() ;
| @bar defined externally }
|
| @cont() {
| ...
| }
|
| To fix the above issue, we introduce a method, which restores the
| pointers Sp, Hp and the register R1 (among everything else, what has
| to be restored) and 'fakes' the continuation. Note, that '@restore'
| returns without calling into the continuation. This way, the call into
| the continuation can be made direct. The above code would be
| transformed into something similar to the following:
|
| @foo() {
| ...
| store @restore(), %Sp ; 'fake' continuation call @bar() ; br label
| restore
|
| restore:
| %Sp = load (getelementptr @environment 0, 0) ; restore what has to be
| ... ; restored tail call @cont() }
|
| @cont() {
| ...
| }
|
| @restore(%Sp, %Hp, %R1, ...) {
| store %Sp, (getelementptr @environment 0, 0) ; store everything into
| ... ; global variable ret void }
|
| @environment ; structure of type similar to {i64*, i64*, i64*, i64,
| i64, ..}
|
| As an alternative, the above transformation could be done by a LLVM
| pass. llvmGen could then annotate the continuation of calls call cite.
| This would greatly simplify the detection of the correct continuation.
|
| What do you think about this idea? Especially when thinking about the
| interference with CPS?
|
| Regards,
|
| David
|
|
|
|
| [1]
| https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/GeneratedCod
| e#Importantconceptsinthemachine
|
| [2]
| https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/CodeGen#Seco
| ndstage:theCmmpipeline
|
|
|
|
| On 11/21/2015 12:00 AM, Ben Gamari wrote:
| > Simon Peyton Jones
participants (3)
-
Ben Gamari
-
David Spitzenberg
-
Simon Peyton Jones