A future for fusion

newer
Naïve question: Compound instance...

older
Representation of single-method...

Jaro Reinders

8 Dec 2025 8 Dec '25

6:05 p.m.

Dear Simon Peyton Jones and other interested GHC devs, In a recent thread on the issue tracker of the vector package, I was asked for my thoughts on a specific improvement to their fusion system [1]. I replied I was a bit disillusioned with GHC's approach to fusion because it is not reliable enough for the average user. Simon pointed me to a GHC plugin by Harendra Kumar which acts as a lubricant for fusion, telling GHC to inline more than usual [2]. I suggested we could integrate this into GHC by giving an inlining discount to any function which contains function that is also mentioned in a rewrite rule. However, I also noted I think this will have a large impact on code size and compilation time in larger code bases. Simon agreed, but noted that Harendra's plugin is much more selective than what I suggested. Since I think this discussion is not that relevant to the particular issue in the vector package, I'd like to continue this discussion here. To figure out what Harendra's plugin does, I skimmed the blog post and found this explanation [2]:

...

fusion-plugin gives programmers control through fusion annotations.

{-# ANN type Step Fuse #-} data Step s a = Yield a s | Skip s | Stop

This tells the compiler:

Any binding that scrutinizes or constructs Step must be inlined, no matter its size.

The plugin scans bindings during the simplifier pass:

* If a fusible type (Step) is involved → mark it INLINE. * Run another simplifier pass → constructors eliminated → fusion restored.

Translating this to the fold/build fusion system in GHC today, we would force inlining of any function that calls `foldr` or `build` (or `augment`). Essentially, we would want to inline any binding that refers to a function that partakes in the rewrite rules for fusion. I'll admit we probably don't have to generalise this to all rewrite rules, but even with this restriction it seems like a rather aggressive optimisation strategy to me. Please correct me if I have misinterpreted anything. Cheers, Jaro [1] https://github.com/haskell/vector/issues/156#issuecomment-3623339336 [2] https://blog.composewell.com/versions/v2/posts/streamly-fusion-3.html

Show replies by date

Simon Peyton Jones

8 Dec 8 Dec

10:33 p.m.

...

it seems like a rather aggressive optimisation strategy to me.

It is indeed aggressive. Essentially when you declare a data type (like Step) to be fusible you are declaring that you never want a single value of type `Step` to be actually built at runtime -- they should all be fused away. If a function returns value of type `Step`, it *can't* be fused away; so we have to inline the function. Ditto if it takes a value of type `Step` as argument. But in Harendra's applications it works rather well apparently. What is the goal you want to achieve. "Fuse where possible" perhaps? But what is "possible"? I have a feeling that in your library you do have a very clear idea of what should be fused and what should not. What is that idea? Simon On Mon, 8 Dec 2025 at 18:05, Jaro Reinders wrote:

...

Dear Simon Peyton Jones and other interested GHC devs,

In a recent thread on the issue tracker of the vector package, I was asked for my thoughts on a specific improvement to their fusion system [1]. I replied I was a bit disillusioned with GHC's approach to fusion because it is not reliable enough for the average user.

Simon pointed me to a GHC plugin by Harendra Kumar which acts as a lubricant for fusion, telling GHC to inline more than usual [2].

I suggested we could integrate this into GHC by giving an inlining discount to any function which contains function that is also mentioned in a rewrite rule. However, I also noted I think this will have a large impact on code size and compilation time in larger code bases. Simon agreed, but noted that Harendra's plugin is much more selective than what I suggested.

Since I think this discussion is not that relevant to the particular issue in the vector package, I'd like to continue this discussion here.

To figure out what Harendra's plugin does, I skimmed the blog post and found this explanation [2]:

...
fusion-plugin gives programmers control through fusion annotations.

{-# ANN type Step Fuse #-} data Step s a = Yield a s | Skip s | Stop

This tells the compiler:

Any binding that scrutinizes or constructs Step must be inlined, no matter its size.

The plugin scans bindings during the simplifier pass:

* If a fusible type (Step) is involved → mark it INLINE. * Run another simplifier pass → constructors eliminated → fusion restored.

Translating this to the fold/build fusion system in GHC today, we would force inlining of any function that calls `foldr` or `build` (or `augment`). Essentially, we would want to inline any binding that refers to a function that partakes in the rewrite rules for fusion. I'll admit we probably don't have to generalise this to all rewrite rules, but even with this restriction it seems like a rather aggressive optimisation strategy to me.

Please correct me if I have misinterpreted anything.

Cheers,

Jaro

[1] https://github.com/haskell/vector/issues/156#issuecomment-3623339336 [2] https://blog.composewell.com/versions/v2/posts/streamly-fusion-3.html

Harendra Kumar

9 Dec 9 Dec

12:03 a.m.

On Mon, 8 Dec 2025 at 23:35, Jaro Reinders wrote:

...

I suggested we could integrate this into GHC by giving an inlining discount to any function which contains function that is also mentioned in a rewrite rule.

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors, the key optimizations for achieving this are case-of-case and worker-wrapper. Both types of fusion require inlining but they are clearly different at the implementation level, even though the same high level ideas may apply in both cases. The fusion-plugin deals with stream fusion only -- where inlining and spec-constr are used to expose constructors which are then eliminated by case-of-case (for linear pipelines) or worker wrapper (for recursive loops) transformations. fusion-plugin has been used extensively in streamly and we have been able to fuse almost all cases that we wanted to. I would recommend that you try out fusion-plugin on the vector package, it can help fuse the pesky cases and you can also gain some insight into how much code bloat it actually causes in practice. Vector is primarily based on stream fusion, and even though it uses rewrite rules, those are not for fusion as such I believe.

...

However, I also noted I think this will have a large impact on code size and compilation time in larger code bases. Simon agreed, but noted that Harendra's plugin is much more selective than what I suggested.

Typically we have a pipeline of functions connected by fusible constructors. To fuse the entire pipeline completely we need to inline all the functions involved in the pipeline. Thus instead of having multiple small functions we end up creating one large combined function. However, when we use case-of-case transformation on the combined function, the size often gets reduced dramatically. In fact the total size of the end product is usually smaller than the combined original components in the pipeline. However, ghc requires more memory and CPU to simplify, especially if the loop becomes too large. In our experience aggressive inlining for fusing linear segments of pipelines even if they are pretty big has almost never been a problem. If it becomes large we can always break the pipeline by using a NOINLINE, making a trade off between runtime perf and compile time resource requirement, but we rarely needed to do that. The code bloat comes more often from the spec-constr optimization when we are fusing recursive loops. In some cases we have used -fspec-constr-recursive=16, which I admit is a very large value, the compile times can sometimes increase dramatically due to this. We can always choose to use a lower value which may result in unfused code in some cases. There may be scope of improvement in GHC for making this optimization smarter and more selective. We can also put some hard thresholds or heuristics here for the cases when the resulting monolithic functions become too large. One possible way to decide whether to inline or not could be to check if the size of the combined/optimized function is smaller than or roughly equivalent to the sum of the individual parts. If inlining does not result in elimination of the constructors in subsequent passes, then the inlining is not useful. But we have seen that it is almost always possible to eliminate all or most of the constructors.

...

Translating this to the fold/build fusion system in GHC today, we would force inlining of any function that calls `foldr` or `build` (or `augment`). Essentially, we would want to inline any binding that refers to a function that partakes in the rewrite rules for fusion. I'll admit we probably don't have to generalise this to all rewrite rules, but even with this restriction it seems like a rather aggressive optimisation strategy to me.

I do not have much experience with using foldr/build. But conceptually this is quite similar to what we do in stream fusion and similar ideas should be applicable here as well. In this case we inline to expose foldr/build primitives and then use rewrite rules to combine those. If the inlining does not result in further elimination or firing of rewrite rules then the inlining is not useful. You can probably try out automatic inlining in the same way as we do in fusion-plugin and see if code-bloat/compilation times become a real issue, and in which particular cases, then put heuristics/thresholds to avoid that. Is there a driving use case for extending this to foldr/build? -harendra

Simon Peyton Jones

9:48 a.m.

...

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors

I'm not really getting this distinction. foldr/build fusion eliminates intermediate lists (cons cells); stream fusion eliminates intermediate Step constructors. What is the difference, really? The only difference in my mind is that you can declare the entire Step data type to be fusible, thereby driving aggressive inlining. Step is *only* an intermediate data type. But we can't do that for lists because they are ubiquitous. BTW Harendra, what happens if you have sharing let ys :: Stream Int = h x in f ys + g ys where one Stream is consumed by two consumers, f and g. Now it's not so easy to eliminate the intermediate, and aggressively inlining f, g, and h will simply increase code size. Simon On Tue, 9 Dec 2025 at 00:03, Harendra Kumar wrote:

...

On Mon, 8 Dec 2025 at 23:35, Jaro Reinders wrote:

...
I suggested we could integrate this into GHC by giving an inlining

discount to

...
any function which contains function that is also mentioned in a rewrite rule.

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors, the key optimizations for achieving this are case-of-case and worker-wrapper. Both types of fusion require inlining but they are clearly different at the implementation level, even though the same high level ideas may apply in both cases.

The fusion-plugin deals with stream fusion only -- where inlining and spec-constr are used to expose constructors which are then eliminated by case-of-case (for linear pipelines) or worker wrapper (for recursive loops) transformations. fusion-plugin has been used extensively in streamly and we have been able to fuse almost all cases that we wanted to. I would recommend that you try out fusion-plugin on the vector package, it can help fuse the pesky cases and you can also gain some insight into how much code bloat it actually causes in practice. Vector is primarily based on stream fusion, and even though it uses rewrite rules, those are not for fusion as such I believe.

...
However, I also noted I think this will have a large impact on code size and compilation time in larger code bases. Simon agreed, but noted that Harendra's plugin is much more selective than what I suggested.

Typically we have a pipeline of functions connected by fusible constructors. To fuse the entire pipeline completely we need to inline all the functions involved in the pipeline. Thus instead of having multiple small functions we end up creating one large combined function. However, when we use case-of-case transformation on the combined function, the size often gets reduced dramatically. In fact the total size of the end product is usually smaller than the combined original components in the pipeline. However, ghc requires more memory and CPU to simplify, especially if the loop becomes too large.

In our experience aggressive inlining for fusing linear segments of pipelines even if they are pretty big has almost never been a problem. If it becomes large we can always break the pipeline by using a NOINLINE, making a trade off between runtime perf and compile time resource requirement, but we rarely needed to do that. The code bloat comes more often from the spec-constr optimization when we are fusing recursive loops. In some cases we have used -fspec-constr-recursive=16, which I admit is a very large value, the compile times can sometimes increase dramatically due to this. We can always choose to use a lower value which may result in unfused code in some cases. There may be scope of improvement in GHC for making this optimization smarter and more selective.

We can also put some hard thresholds or heuristics here for the cases when the resulting monolithic functions become too large. One possible way to decide whether to inline or not could be to check if the size of the combined/optimized function is smaller than or roughly equivalent to the sum of the individual parts. If inlining does not result in elimination of the constructors in subsequent passes, then the inlining is not useful. But we have seen that it is almost always possible to eliminate all or most of the constructors.

...
Translating this to the fold/build fusion system in GHC today, we would force inlining of any function that calls `foldr` or `build` (or `augment`). Essentially, we would want to inline any binding that refers to a function that partakes in the rewrite rules for fusion. I'll admit we probably don't have to generalise this to all rewrite rules, but even with this restriction it seems like a rather aggressive optimisation strategy to me.

I do not have much experience with using foldr/build. But conceptually this is quite similar to what we do in stream fusion and similar ideas should be applicable here as well. In this case we inline to expose foldr/build primitives and then use rewrite rules to combine those. If the inlining does not result in further elimination or firing of rewrite rules then the inlining is not useful. You can probably try out automatic inlining in the same way as we do in fusion-plugin and see if code-bloat/compilation times become a real issue, and in which particular cases, then put heuristics/thresholds to avoid that. Is there a driving use case for extending this to foldr/build?

-harendra

Andreas Klebinger

9:55 a.m.

...

BTW Harendra, what happens if you have sharing

Typically using a stream implies you *really* want constant space use over sharing. So the stream should just be inlined twice at the cost of code size. A colleague from WT wrote a whole blog post on this topic a few years back: https://www.well-typed.com/blog/2016/09/sharing-conduit/ On 09/12/2025 10:48, Simon Peyton Jones wrote:

...

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors

I'm not really getting this distinction. foldr/build fusion eliminates intermediate lists (cons cells); stream fusion eliminates intermediate Step constructors. What is the difference, really?

The only difference in my mind is that you can declare the entire Step data type to be fusible, thereby driving aggressive inlining. Step is *only* an intermediate data type. But we can't do that for lists because they are ubiquitous.

BTW Harendra, what happens if you have sharing

let ys :: Stream Int = h x in f ys + g ys

where one Stream is consumed by two consumers, f and g. Now it's not so easy to eliminate the intermediate, and aggressively inlining f, g, and h will simply increase code size.

Simon

On Tue, 9 Dec 2025 at 00:03, Harendra Kumar wrote:

On Mon, 8 Dec 2025 at 23:35, Jaro Reinders wrote: > > I suggested we could integrate this into GHC by giving an inlining discount to > any function which contains function that is also mentioned in a rewrite rule.

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors, the key optimizations for achieving this are case-of-case and worker-wrapper. Both types of fusion require inlining but they are clearly different at the implementation level, even though the same high level ideas may apply in both cases.

The fusion-plugin deals with stream fusion only -- where inlining and spec-constr are used to expose constructors which are then eliminated by case-of-case (for linear pipelines) or worker wrapper (for recursive loops) transformations. fusion-plugin has been used extensively in streamly and we have been able to fuse almost all cases that we wanted to. I would recommend that you try out fusion-plugin on the vector package, it can help fuse the pesky cases and you can also gain some insight into how much code bloat it actually causes in practice. Vector is primarily based on stream fusion, and even though it uses rewrite rules, those are not for fusion as such I believe.

> However, I also noted I think this will have a large impact on code size and > compilation time in larger code bases. Simon agreed, but noted that Harendra's > plugin is much more selective than what I suggested.

Typically we have a pipeline of functions connected by fusible constructors. To fuse the entire pipeline completely we need to inline all the functions involved in the pipeline. Thus instead of having multiple small functions we end up creating one large combined function. However, when we use case-of-case transformation on the combined function, the size often gets reduced dramatically. In fact the total size of the end product is usually smaller than the combined original components in the pipeline. However, ghc requires more memory and CPU to simplify, especially if the loop becomes too large.

In our experience aggressive inlining for fusing linear segments of pipelines even if they are pretty big has almost never been a problem. If it becomes large we can always break the pipeline by using a NOINLINE, making a trade off between runtime perf and compile time resource requirement, but we rarely needed to do that. The code bloat comes more often from the spec-constr optimization when we are fusing recursive loops. In some cases we have used -fspec-constr-recursive=16, which I admit is a very large value, the compile times can sometimes increase dramatically due to this. We can always choose to use a lower value which may result in unfused code in some cases. There may be scope of improvement in GHC for making this optimization smarter and more selective.

We can also put some hard thresholds or heuristics here for the cases when the resulting monolithic functions become too large. One possible way to decide whether to inline or not could be to check if the size of the combined/optimized function is smaller than or roughly equivalent to the sum of the individual parts. If inlining does not result in elimination of the constructors in subsequent passes, then the inlining is not useful. But we have seen that it is almost always possible to eliminate all or most of the constructors.

> Translating this to the fold/build fusion system in GHC today, we would force > inlining of any function that calls `foldr` or `build` (or `augment`). > Essentially, we would want to inline any binding that refers to a function that > partakes in the rewrite rules for fusion. I'll admit we probably don't have to > generalise this to all rewrite rules, but even with this restriction it seems > like a rather aggressive optimisation strategy to me.

I do not have much experience with using foldr/build. But conceptually this is quite similar to what we do in stream fusion and similar ideas should be applicable here as well. In this case we inline to expose foldr/build primitives and then use rewrite rules to combine those. If the inlining does not result in further elimination or firing of rewrite rules then the inlining is not useful. You can probably try out automatic inlining in the same way as we do in fusion-plugin and see if code-bloat/compilation times become a real issue, and in which particular cases, then put heuristics/thresholds to avoid that. Is there a driving use case for extending this to foldr/build?

-harendra

_______________________________________________ ghc-devs mailing list --ghc-devs@haskell.org To unsubscribe send an email toghc-devs-leave@haskell.org

Harendra Kumar

11 Dec 11 Dec

8:32 a.m.

On Tue, 9 Dec 2025 at 15:19, Simon Peyton Jones wrote:

...

I'm not really getting this distinction. foldr/build fusion eliminates intermediate lists (cons cells); stream fusion eliminates intermediate Step constructors. What is the difference, really?

My point was that these two fusion approaches have very different implementations: foldr/build relies on identifying the respective functions and their rewrite rules, while stream fusion relies on recognizing the constructors involved. It wasn’t clear which one we were discussing. Jaro was coming from the point of view of the vector library, where fusion is based on the Step constructor, and then he mentioned rewrite rules, which don’t apply to stream fusion. So I’m unsure of the goal: is it improving list fusion, or making vector-style stream fusion more reliable, possibly by integrating fusion-plugin ideas into GHC.

...

The only difference in my mind is that you can declare the entire Step data type to be fusible, thereby driving aggressive inlining. Step is *only* an intermediate data type. But we can't do that for lists because they are ubiquitous.

In the list fusion case, constructor marking scheme will not apply, in this case we will have to identify if a function results in foldr/build primitives and if a rewrite rule exists that can eliminate it, if so we can mark the function with INLINE. So the corresponding concept here would be to mark and find the foldr/build functions rather than the list constructors. However, this is already achieved by manually marking all such functions to be inlined, though it may be fragile. In case of stream-fusion also the programmer marks the involved function with INLINE, but the core problem that the fusion-plugin solves is that the internal join-points/let bindings created by GHC during core-2-core transformations are not marked inline, therefore, the programmer loses control over inlining. I am not sure if such a problem even exists in case of list fusion.

...

BTW Harendra, what happens if you have sharing

let ys :: Stream Int = h x in f ys + g ys

where one Stream is consumed by two consumers, f and g. Now it's not so easy to eliminate the intermediate, and aggressively inlining f, g, and h will simply increase code size.

First of all, I would discourage writing code which duplicates the stream. The recommended approach is to use fold combinators to combine the folds "f" and "g" here, this is achieved by using the "tee" or "teeWith" combinators in streamly. The resulting fold consumes a single stream and distributes it to both the folds, achieving complete fusion. If someone does write this code, in the case of stream fusion both f and g will get inlined and fuse with "ys", so fusion is not an issue. However, "ys" gets duplicated which has its own ramifications in impure code, but that's a different issue altogether. -harendra

Simon Peyton Jones

9:09 a.m.

...

foldr/build relies on identifying the respective functions and their rewrite rules, while stream fusion relies on recognizing the constructors involved.

Ah, I think this is what I was missing. Stream fusion does not use RULEs at all. (Except for the (stream . unstream = id) rule, which is needed only in the conversion to/from regular lists, which you probably aboie.) But still, I think the key features of stream fusion and foldr/build fusion are the same: *we must do enough inlining*. INLINE pragmas may help; but join points are a problem because they are introduced by the compiler. Your plugin uses a criterion like "if the function consumes or returns a fusible type, inline it", although I'm terribly hazy about the precise specification. That same plugin might well be useful for other RULE-based libraries too. E.g. whether `map` returns explicit constructors (like (:) and []) or pseudo-constructors like (build g) doesn't matter. What matters is that the function is inlined. Two other thoughts First: inlining join points aggressively can cause exponential code blow-up. Consider join j1 x = blah join j2 x = case x of { Left x1 -> j1 x1; Right x2 -> j1 (x2+1) } join j3 x = case x of { Left x1 -> j2 x1; Right x2 -> j2 (x2+1) } join j4 x = case x of { Left x1 -> j3 x1; Right x2 -> j3 (x2+1) } in ... The types aren't right here but you'll get the idea. If you inline all the join points, you unravel a DAG into a tree of exponential size. I'm not sure how you guard against that. Second. Inlining join points may not be necessary to achieve fusion. Section 5 of "Compiling without continuations https://simon.peytonjones.org/compiling-without-continuations/" gives the idea. It might be worth understanding carefully why this doesn't work in your application. Third:

...

If someone does write this code, in the case of stream fusion both f and g will get inlined and fuse with "ys", so fusion is not an issue. However, "ys" gets duplicated

In the example ys = h x So are you saying that (h x) is computed twice? That would cause a perhaps-asymptotic increase in compile time! Is that what you get? You could say "don't do that" but it's hard to be completely sure you haven't. Simon On Thu, 11 Dec 2025 at 08:32, Harendra Kumar wrote:

...

On Tue, 9 Dec 2025 at 15:19, Simon Peyton Jones wrote:

...
I'm not really getting this distinction. foldr/build fusion eliminates

intermediate lists (cons cells); stream fusion eliminates intermediate Step constructors. What is the difference, really?

My point was that these two fusion approaches have very different implementations: foldr/build relies on identifying the respective functions and their rewrite rules, while stream fusion relies on recognizing the constructors involved. It wasn’t clear which one we were discussing. Jaro was coming from the point of view of the vector library, where fusion is based on the Step constructor, and then he mentioned rewrite rules, which don’t apply to stream fusion. So I’m unsure of the goal: is it improving list fusion, or making vector-style stream fusion more reliable, possibly by integrating fusion-plugin ideas into GHC.

...
The only difference in my mind is that you can declare the entire Step data type to be fusible, thereby driving aggressive inlining. Step is *only* an intermediate data type. But we can't do that for lists because they are ubiquitous.

In the list fusion case, constructor marking scheme will not apply, in this case we will have to identify if a function results in foldr/build primitives and if a rewrite rule exists that can eliminate it, if so we can mark the function with INLINE. So the corresponding concept here would be to mark and find the foldr/build functions rather than the list constructors. However, this is already achieved by manually marking all such functions to be inlined, though it may be fragile. In case of stream-fusion also the programmer marks the involved function with INLINE, but the core problem that the fusion-plugin solves is that the internal join-points/let bindings created by GHC during core-2-core transformations are not marked inline, therefore, the programmer loses control over inlining. I am not sure if such a problem even exists in case of list fusion.

...
BTW Harendra, what happens if you have sharing

let ys :: Stream Int = h x in f ys + g ys

where one Stream is consumed by two consumers, f and g. Now it's not so easy to eliminate the intermediate, and aggressively inlining f, g, and h will simply increase code size.

First of all, I would discourage writing code which duplicates the stream. The recommended approach is to use fold combinators to combine the folds "f" and "g" here, this is achieved by using the "tee" or "teeWith" combinators in streamly. The resulting fold consumes a single stream and distributes it to both the folds, achieving complete fusion. If someone does write this code, in the case of stream fusion both f and g will get inlined and fuse with "ys", so fusion is not an issue. However, "ys" gets duplicated which has its own ramifications in impure code, but that's a different issue altogether.

-harendra

Harendra Kumar

11:28 a.m.

On Thu, 11 Dec 2025 at 14:39, Simon Peyton Jones wrote:

...

But still, I think the key features of stream fusion and foldr/build fusion are the same: we must do enough inlining. INLINE pragmas may help; but join points are a problem because they are introduced by the compiler.

I agree inlining is key in both cases, inlining the right functions and no more is the goal.

...

Your plugin uses a criterion like "if the function consumes or returns a fusible type, inline it", although I'm terribly hazy about the precise specification.

That is correct at high level, rest are details. Our ultimate goal is to not have the fusible constructors at all in the final core.

...

That same plugin might well be useful for other RULE-based libraries too. E.g. whether `map` returns explicit constructors (like (:) and []) or pseudo-constructors like (build g) doesn't matter. What matters is that the function is inlined.

That is possible. The first step would be to detect such cases and warn the user about presence of fusible constructors and therefore a missed fusion opportunity.

...

Two other thoughts

First: inlining join points aggressively can cause exponential code blow-up. Consider join j1 x = blah join j2 x = case x of { Left x1 -> j1 x1; Right x2 -> j1 (x2+1) } join j3 x = case x of { Left x1 -> j2 x1; Right x2 -> j2 (x2+1) } join j4 x = case x of { Left x1 -> j3 x1; Right x2 -> j3 (x2+1) } in ... The types aren't right here but you'll get the idea. If you inline all the join points, you unravel a DAG into a tree of exponential size.

I'm not sure how you guard against that.

One way is to put an upper limit on the resulting function is -- if it exceeds the limit then do not inline, instead warn the user that fusion would result in a large intermediate function. The user can still force it if they know what they are doing, by relaxing the limit.

...

Second. Inlining join points may not be necessary to achieve fusion. Section 5 of "Compiling without continuations" gives the idea. It might be worth understanding carefully why this doesn't work in your application.

That's a bit surprising to me that it's possible. I will try to understand it.

...

Third:

...
If someone does write this code, in the case of stream fusion both f and g will get inlined and fuse with "ys", so fusion is not an issue. However, "ys" gets duplicated

In the example ys = h x So are you saying that (h x) is computed twice? That would cause a perhaps-asymptotic increase in compile time! Is that what you get? You could say "don't do that" but it's hard to be completely sure you haven't.

Yes, "h x" is a stream and both f and g are independently consuming and fusing it, generating two different fused loops end to end. For monadic streams in streamly the equivalent code looks like this: module Example (f) where import qualified Streamly.Data.Stream as Stream import qualified Streamly.Data.Fold as Fold f :: IO Int f = do let ys = Stream.enumerateFromTo 1 (10 :: Int) x <- Stream.fold Fold.sum ys y <- Stream.fold Fold.length ys return (x + y) And the fused core for the above code looks like this: Rec { f_$s$wgo = \ sc_s2KE sc1_s2KF ww_s2IJ eta_s2IL -> case <=# ww_s2IJ 10# of { __DEFAULT -> joinrec { $s$wgo_s2KM sc2_s2KL sc3_s2KK sc4_s2KJ = case <=# sc3_s2KK 10# of { __DEFAULT -> (# sc2_s2KL, I# (+# sc_s2KE sc4_s2KJ) #); 1# -> jump $s$wgo_s2KM sc2_s2KL (+# sc3_s2KK 1#) (+# sc4_s2KJ 1#) }; } in jump $s$wgo_s2KM eta_s2IL 1# 0#; 1# -> let { incr_s1Xd = -# ww_s2IJ sc1_s2KF } in let { total1_s1Xe = +# sc_s2KE incr_s1Xd } in f_$s$wgo total1_s1Xe (-# (-# total1_s1Xe sc_s2KE) incr_s1Xd) (+# ww_s2IJ 1#) eta_s2IL } end Rec } f1 = \ s_a1u0 -> f_$s$wgo 0# 0# 1# s_a1u0 f = f1 `cast` Co:3 :: ... We can see two different recursive loops are generated, each with 10 iterations, one corresponds to "sum" and the other to "length". So yes, "h x" is independently evaluated in both cases. -harendra

Simon Peyton Jones

11:35 a.m.

...

We can see two different recursive loops are generated, each with 10 iterations, one corresponds to "sum" and the other to "length". So yes, "h x" is independently evaluated in both cases.

So if `h x` is `map expensive x`, and x is the stream x1,x2,x3, ... then we'll compute (expensive x1) twice and similarly (expensive x2). GHC is usually pathological averse to duplicating expensive computations like this; I'm not sure how you torture it into doing so! Simon On Thu, 11 Dec 2025 at 11:28, Harendra Kumar wrote:

...

On Thu, 11 Dec 2025 at 14:39, Simon Peyton Jones wrote:

...
But still, I think the key features of stream fusion and foldr/build fusion are the same: we must do enough inlining. INLINE pragmas may help; but join points are a problem because they are introduced by the compiler.

I agree inlining is key in both cases, inlining the right functions and no more is the goal.

...
Your plugin uses a criterion like "if the function consumes or returns a fusible type, inline it", although I'm terribly hazy about the precise specification.

That is correct at high level, rest are details. Our ultimate goal is to not have the fusible constructors at all in the final core.

...
That same plugin might well be useful for other RULE-based libraries too. E.g. whether `map` returns explicit constructors (like (:) and []) or pseudo-constructors like (build g) doesn't matter. What matters is that the function is inlined.

That is possible. The first step would be to detect such cases and warn the user about presence of fusible constructors and therefore a missed fusion opportunity.

...
Two other thoughts

First: inlining join points aggressively can cause exponential code blow-up. Consider join j1 x = blah join j2 x = case x of { Left x1 -> j1 x1; Right x2 -> j1 (x2+1) } join j3 x = case x of { Left x1 -> j2 x1; Right x2 -> j2 (x2+1) } join j4 x = case x of { Left x1 -> j3 x1; Right x2 -> j3 (x2+1) } in ... The types aren't right here but you'll get the idea. If you inline all the join points, you unravel a DAG into a tree of exponential size.

I'm not sure how you guard against that.

One way is to put an upper limit on the resulting function is -- if it exceeds the limit then do not inline, instead warn the user that fusion would result in a large intermediate function. The user can still force it if they know what they are doing, by relaxing the limit.

...
Second. Inlining join points may not be necessary to achieve fusion. Section 5 of "Compiling without continuations" gives the idea. It might be worth understanding carefully why this doesn't work in your application.

That's a bit surprising to me that it's possible. I will try to understand it.

...
Third:

...
If someone does write this code, in the case of stream fusion both f and g will get inlined and fuse with "ys", so fusion is not an issue. However, "ys" gets duplicated

In the example ys = h x So are you saying that (h x) is computed twice? That would cause a perhaps-asymptotic increase in compile time! Is that what you get? You could say "don't do that" but it's hard to be completely sure you haven't.

Yes, "h x" is a stream and both f and g are independently consuming and fusing it, generating two different fused loops end to end.

For monadic streams in streamly the equivalent code looks like this:

module Example (f) where

import qualified Streamly.Data.Stream as Stream import qualified Streamly.Data.Fold as Fold

f :: IO Int f = do let ys = Stream.enumerateFromTo 1 (10 :: Int) x <- Stream.fold Fold.sum ys y <- Stream.fold Fold.length ys return (x + y)

And the fused core for the above code looks like this:

Rec { f_$s$wgo = \ sc_s2KE sc1_s2KF ww_s2IJ eta_s2IL -> case <=# ww_s2IJ 10# of { __DEFAULT -> joinrec { $s$wgo_s2KM sc2_s2KL sc3_s2KK sc4_s2KJ = case <=# sc3_s2KK 10# of { __DEFAULT -> (# sc2_s2KL, I# (+# sc_s2KE sc4_s2KJ) #); 1# -> jump $s$wgo_s2KM sc2_s2KL (+# sc3_s2KK 1#) (+# sc4_s2KJ 1#) }; } in jump $s$wgo_s2KM eta_s2IL 1# 0#; 1# -> let { incr_s1Xd = -# ww_s2IJ sc1_s2KF } in let { total1_s1Xe = +# sc_s2KE incr_s1Xd } in f_$s$wgo total1_s1Xe (-# (-# total1_s1Xe sc_s2KE) incr_s1Xd) (+# ww_s2IJ 1#) eta_s2IL } end Rec }

f1 = \ s_a1u0 -> f_$s$wgo 0# 0# 1# s_a1u0

f = f1 `cast` Co:3 :: ...

We can see two different recursive loops are generated, each with 10 iterations, one corresponds to "sum" and the other to "length". So yes, "h x" is independently evaluated in both cases.

-harendra

Matthew Pickering

9 Dec 9 Dec

9:53 a.m.

I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner. For example: https://andraskovacs.github.io/pdfs/2ltt_icfp24.pdf By allowing a programmer to express which parts of the program should only exist code generation time, you guarantee that fusion occurs. This seems much more aligned with how Haskell programmers think about other things. Cheers, Matt On Tue, Dec 9, 2025 at 12:03 AM Harendra Kumar wrote:

...

On Mon, 8 Dec 2025 at 23:35, Jaro Reinders wrote:

...
I suggested we could integrate this into GHC by giving an inlining discount to any function which contains function that is also mentioned in a rewrite rule.

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors, the key optimizations for achieving this are case-of-case and worker-wrapper. Both types of fusion require inlining but they are clearly different at the implementation level, even though the same high level ideas may apply in both cases.

The fusion-plugin deals with stream fusion only -- where inlining and spec-constr are used to expose constructors which are then eliminated by case-of-case (for linear pipelines) or worker wrapper (for recursive loops) transformations. fusion-plugin has been used extensively in streamly and we have been able to fuse almost all cases that we wanted to. I would recommend that you try out fusion-plugin on the vector package, it can help fuse the pesky cases and you can also gain some insight into how much code bloat it actually causes in practice. Vector is primarily based on stream fusion, and even though it uses rewrite rules, those are not for fusion as such I believe.

...
However, I also noted I think this will have a large impact on code size and compilation time in larger code bases. Simon agreed, but noted that Harendra's plugin is much more selective than what I suggested.

Typically we have a pipeline of functions connected by fusible constructors. To fuse the entire pipeline completely we need to inline all the functions involved in the pipeline. Thus instead of having multiple small functions we end up creating one large combined function. However, when we use case-of-case transformation on the combined function, the size often gets reduced dramatically. In fact the total size of the end product is usually smaller than the combined original components in the pipeline. However, ghc requires more memory and CPU to simplify, especially if the loop becomes too large.

In our experience aggressive inlining for fusing linear segments of pipelines even if they are pretty big has almost never been a problem. If it becomes large we can always break the pipeline by using a NOINLINE, making a trade off between runtime perf and compile time resource requirement, but we rarely needed to do that. The code bloat comes more often from the spec-constr optimization when we are fusing recursive loops. In some cases we have used -fspec-constr-recursive=16, which I admit is a very large value, the compile times can sometimes increase dramatically due to this. We can always choose to use a lower value which may result in unfused code in some cases. There may be scope of improvement in GHC for making this optimization smarter and more selective.

We can also put some hard thresholds or heuristics here for the cases when the resulting monolithic functions become too large. One possible way to decide whether to inline or not could be to check if the size of the combined/optimized function is smaller than or roughly equivalent to the sum of the individual parts. If inlining does not result in elimination of the constructors in subsequent passes, then the inlining is not useful. But we have seen that it is almost always possible to eliminate all or most of the constructors.

...
Translating this to the fold/build fusion system in GHC today, we would force inlining of any function that calls `foldr` or `build` (or `augment`). Essentially, we would want to inline any binding that refers to a function that partakes in the rewrite rules for fusion. I'll admit we probably don't have to generalise this to all rewrite rules, but even with this restriction it seems like a rather aggressive optimisation strategy to me.

I do not have much experience with using foldr/build. But conceptually this is quite similar to what we do in stream fusion and similar ideas should be applicable here as well. In this case we inline to expose foldr/build primitives and then use rewrite rules to combine those. If the inlining does not result in further elimination or firing of rewrite rules then the inlining is not useful. You can probably try out automatic inlining in the same way as we do in fusion-plugin and see if code-bloat/compilation times become a real issue, and in which particular cases, then put heuristics/thresholds to avoid that. Is there a driving use case for extending this to foldr/build?

-harendra _______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

Georgi Lyubenov

11:41 a.m.

Hey all, I'm not really a regularly contributing ghc dev, so I might be speaking without enough context, but I want to express that I agree with Matt here. The main problem with fusion IMO is that there is no easy way to determine whether it has happened to the level that the programmer actually wants it to happen. Hence, I think the main thing to solve here is to give the programmer the ability to easily and natively say "don't compile at all if this doesn't fuse/specialise" or alternatively "the only way to compile this is if it fuses/specialises". I believe this is the approach which langauges/systems which heavily focus on specialisations, like AnyDSL/Thorin/Impala, take, and I think it is somewhat successful, although still quite experimental. Cheers, Georgi On 12/9/25 11:53, Matthew Pickering wrote:

...

I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

For example: https://andraskovacs.github.io/pdfs/2ltt_icfp24.pdf

By allowing a programmer to express which parts of the program should only exist code generation time, you guarantee that fusion occurs. This seems much more aligned with how Haskell programmers think about other things.

Cheers,

Matt

On Tue, Dec 9, 2025 at 12:03 AM Harendra Kumar wrote:

...
...
I suggested we could integrate this into GHC by giving an inlining discount to any function which contains function that is also mentioned in a rewrite rule. There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function,

On Mon, 8 Dec 2025 at 23:35, Jaro Reinders wrote: forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors, the key optimizations for achieving this are case-of-case and worker-wrapper. Both types of fusion require inlining but they are clearly different at the implementation level, even though the same high level ideas may apply in both cases.

The fusion-plugin deals with stream fusion only -- where inlining and spec-constr are used to expose constructors which are then eliminated by case-of-case (for linear pipelines) or worker wrapper (for recursive loops) transformations. fusion-plugin has been used extensively in streamly and we have been able to fuse almost all cases that we wanted to. I would recommend that you try out fusion-plugin on the vector package, it can help fuse the pesky cases and you can also gain some insight into how much code bloat it actually causes in practice. Vector is primarily based on stream fusion, and even though it uses rewrite rules, those are not for fusion as such I believe.

...
However, I also noted I think this will have a large impact on code size and compilation time in larger code bases. Simon agreed, but noted that Harendra's plugin is much more selective than what I suggested. Typically we have a pipeline of functions connected by fusible constructors. To fuse the entire pipeline completely we need to inline all the functions involved in the pipeline. Thus instead of having multiple small functions we end up creating one large combined function. However, when we use case-of-case transformation on the combined function, the size often gets reduced dramatically. In fact the total size of the end product is usually smaller than the combined original components in the pipeline. However, ghc requires more memory and CPU to simplify, especially if the loop becomes too large.

In our experience aggressive inlining for fusing linear segments of pipelines even if they are pretty big has almost never been a problem. If it becomes large we can always break the pipeline by using a NOINLINE, making a trade off between runtime perf and compile time resource requirement, but we rarely needed to do that. The code bloat comes more often from the spec-constr optimization when we are fusing recursive loops. In some cases we have used -fspec-constr-recursive=16, which I admit is a very large value, the compile times can sometimes increase dramatically due to this. We can always choose to use a lower value which may result in unfused code in some cases. There may be scope of improvement in GHC for making this optimization smarter and more selective.

We can also put some hard thresholds or heuristics here for the cases when the resulting monolithic functions become too large. One possible way to decide whether to inline or not could be to check if the size of the combined/optimized function is smaller than or roughly equivalent to the sum of the individual parts. If inlining does not result in elimination of the constructors in subsequent passes, then the inlining is not useful. But we have seen that it is almost always possible to eliminate all or most of the constructors.

...
Translating this to the fold/build fusion system in GHC today, we would force inlining of any function that calls `foldr` or `build` (or `augment`). Essentially, we would want to inline any binding that refers to a function that partakes in the rewrite rules for fusion. I'll admit we probably don't have to generalise this to all rewrite rules, but even with this restriction it seems like a rather aggressive optimisation strategy to me. I do not have much experience with using foldr/build. But conceptually this is quite similar to what we do in stream fusion and similar ideas should be applicable here as well. In this case we inline to expose foldr/build primitives and then use rewrite rules to combine those. If the inlining does not result in further elimination or firing of rewrite rules then the inlining is not useful. You can probably try out automatic inlining in the same way as we do in fusion-plugin and see if code-bloat/compilation times become a real issue, and in which particular cases, then put heuristics/thresholds to avoid that. Is there a driving use case for extending this to foldr/build?

-harendra _______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

Simon Peyton Jones

12:45 p.m.

...

I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

How close are we to being able to do that in Haskell? Simon On Tue, 9 Dec 2025 at 09:56, Matthew Pickering wrote:

...

I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

For example: https://andraskovacs.github.io/pdfs/2ltt_icfp24.pdf

By allowing a programmer to express which parts of the program should only exist code generation time, you guarantee that fusion occurs. This seems much more aligned with how Haskell programmers think about other things.

Cheers,

Matt

On Tue, Dec 9, 2025 at 12:03 AM Harendra Kumar wrote:

...
On Mon, 8 Dec 2025 at 23:35, Jaro Reinders

wrote:

...
...
I suggested we could integrate this into GHC by giving an inlining

discount to

...
any function which contains function that is also mentioned in a rewrite rule.

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors, the key optimizations for achieving this are case-of-case and worker-wrapper. Both types of fusion require inlining but they are clearly different at the implementation level, even though the same high level ideas may apply in both cases.

The fusion-plugin deals with stream fusion only -- where inlining and spec-constr are used to expose constructors which are then eliminated by case-of-case (for linear pipelines) or worker wrapper (for recursive loops) transformations. fusion-plugin has been used extensively in streamly and we have been able to fuse almost all cases that we wanted to. I would recommend that you try out fusion-plugin on the vector package, it can help fuse the pesky cases and you can also gain some insight into how much code bloat it actually causes in practice. Vector is primarily based on stream fusion, and even though it uses rewrite rules, those are not for fusion as such I believe.

...
However, I also noted I think this will have a large impact on code size and compilation time in larger code bases. Simon agreed, but noted that Harendra's plugin is much more selective than what I suggested.

Typically we have a pipeline of functions connected by fusible constructors. To fuse the entire pipeline completely we need to inline all the functions involved in the pipeline. Thus instead of having multiple small functions we end up creating one large combined function. However, when we use case-of-case transformation on the combined function, the size often gets reduced dramatically. In fact the total size of the end product is usually smaller than the combined original components in the pipeline. However, ghc requires more memory and CPU to simplify, especially if the loop becomes too large.

In our experience aggressive inlining for fusing linear segments of pipelines even if they are pretty big has almost never been a problem. If it becomes large we can always break the pipeline by using a NOINLINE, making a trade off between runtime perf and compile time resource requirement, but we rarely needed to do that. The code bloat comes more often from the spec-constr optimization when we are fusing recursive loops. In some cases we have used -fspec-constr-recursive=16, which I admit is a very large value, the compile times can sometimes increase dramatically due to this. We can always choose to use a lower value which may result in unfused code in some cases. There may be scope of improvement in GHC for making this optimization smarter and more selective.

We can also put some hard thresholds or heuristics here for the cases when the resulting monolithic functions become too large. One possible way to decide whether to inline or not could be to check if the size of the combined/optimized function is smaller than or roughly equivalent to the sum of the individual parts. If inlining does not result in elimination of the constructors in subsequent passes, then the inlining is not useful. But we have seen that it is almost always possible to eliminate all or most of the constructors.

...
Translating this to the fold/build fusion system in GHC today, we would force inlining of any function that calls `foldr` or `build` (or `augment`). Essentially, we would want to inline any binding that refers to a function that partakes in the rewrite rules for fusion. I'll admit we probably don't have to generalise this to all rewrite rules, but even with this restriction it seems like a rather aggressive optimisation strategy to me.

I do not have much experience with using foldr/build. But conceptually this is quite similar to what we do in stream fusion and similar ideas should be applicable here as well. In this case we inline to expose foldr/build primitives and then use rewrite rules to combine those. If the inlining does not result in further elimination or firing of rewrite rules then the inlining is not useful. You can probably try out automatic inlining in the same way as we do in fusion-plugin and see if code-bloat/compilation times become a real issue, and in which particular cases, then put heuristics/thresholds to avoid that. Is there a driving use case for extending this to foldr/build?

-harendra _______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

Sebastian Graf

1:26 p.m.

I agree that multi-stage programming is the principled green field solution. It provides the right model in which to think about solutions for GHC. For GHC, that model certainly means "Step is just an intermediate data structure", i.e., any computation involving `Step` should only live at compile-time ("template polymorphism") and it should be inlined/specialised for aggressively. This is not too different for the situation with list fusion. Any compiled program that *runs* `map`/`filter`/`foldr`/`build` has a performance bug. You always want to end up with `foldr`'s SAT'd unfolding/explicit `(:)`/`[]` instead of a higher-order function `build` which does the same thing. That is achieved by the intricate multiphase rewrite rule framework. If all goes well, the final program never contains any lists whatsoever; only a nesting of tight recursive loops. Note that there is no code sharing of `map`/`filter`/`foldr` here; they must all have been inlined and rewritten and were purely a compile-time thing. We want the same for stream fusion because it has better fusion support for `zipWith`. Inlining any function involving `Step` is absolutely the right thing to do! Aggressive specialization as well, but that's really difficult to do using SpecConstr because it blows up so easily. The stream fusion paper https://www.cs.tufts.edu/~nr/cs257/archive/duncan-coutts/stream-fusion.pdf instead suggests applying the static argument transformation in section 7.2. A draft patch of mine tried to do that on the fly to compute unfoldings for recursive functions with static args: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4553. It lacks polishing and my time and interest, but I firmly believe this should help the stream fusion use case, because SAT+inlining is *exactly* specialization of recursive functions for static arguments. (Side Note: Andrasz pointed out to me two years ago that there were further complications which I don't recall and which mean that he settled on the practical compromise of going for foldr/build fusion and enabling zip fusion/parallel for with stream fusion.) ---

...

what happens if you have sharing let ys :: Stream Int = h x in f ys + g ys

This is a flaw in the way that GHC says "lists are streams". IMO a stream (such as in streamly) should happily duplicate all the work that goes into computing the elements of `ys`, since it is just code. By contrast, data structures such as lists do not have that luxury. `let ys :: [Int] = ...; in f ys ++ g ys` should not try to fuse `ys`. Here's a repo that implements just foldr/build fusion without rewriting lists to/from it: https://github.com/sgraf812/foldr-build. It doesn't have the "fusion duplicates work" flaw because there is no data that should have shared the work in the first place. --- *TLDR;* inlining and specialising `Step` sounds like exactly what is needed. Specialisation of concatMap is the hard part for stream fusion, though, and that means we'll end up with `Step` in the program nonetheless. Btw., Lean's standard `Iterator` framework https://github.com/leanprover/lean4/blob/6e711bf067a0d9dc6623996b4bd3757ac6c... (inspired by Rust's https://doc.rust-lang.org/std/iter/trait.Iterator.html) will implement "stream fusion". (That's what external iteration is.) I'm not saying that it's simple or that the optimiser is equipped to handle it, though :) Cheers, Sebastian Am Di., 9. Dez. 2025 um 13:45 Uhr schrieb Simon Peyton Jones < simon.peytonjones@gmail.com>:

...

I think a solution based on multi-stage programming is the most

...
natural way to tackle the stream fusion problem in a robust manner.

How close are we to being able to do that in Haskell?

Simon

On Tue, 9 Dec 2025 at 09:56, Matthew Pickering < matthewtpickering@gmail.com> wrote:

...
I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

For example: https://andraskovacs.github.io/pdfs/2ltt_icfp24.pdf

By allowing a programmer to express which parts of the program should only exist code generation time, you guarantee that fusion occurs. This seems much more aligned with how Haskell programmers think about other things.

Cheers,

Matt

On Tue, Dec 9, 2025 at 12:03 AM Harendra Kumar wrote:

...
On Mon, 8 Dec 2025 at 23:35, Jaro Reinders

wrote:

...
...
I suggested we could integrate this into GHC by giving an inlining

discount to

...
any function which contains function that is also mentioned in a rewrite rule.

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors, the key optimizations for achieving this are case-of-case and worker-wrapper. Both types of fusion require inlining but they are clearly different at the implementation level, even though the same high level ideas may apply in both cases.

The fusion-plugin deals with stream fusion only -- where inlining and spec-constr are used to expose constructors which are then eliminated by case-of-case (for linear pipelines) or worker wrapper (for recursive loops) transformations. fusion-plugin has been used extensively in streamly and we have been able to fuse almost all cases that we wanted to. I would recommend that you try out fusion-plugin on the vector package, it can help fuse the pesky cases and you can also gain some insight into how much code bloat it actually causes in practice. Vector is primarily based on stream fusion, and even though it uses rewrite rules, those are not for fusion as such I believe.

...
However, I also noted I think this will have a large impact on code size and compilation time in larger code bases. Simon agreed, but noted that Harendra's plugin is much more selective than what I suggested.

Typically we have a pipeline of functions connected by fusible constructors. To fuse the entire pipeline completely we need to inline all the functions involved in the pipeline. Thus instead of having multiple small functions we end up creating one large combined function. However, when we use case-of-case transformation on the combined function, the size often gets reduced dramatically. In fact the total size of the end product is usually smaller than the combined original components in the pipeline. However, ghc requires more memory and CPU to simplify, especially if the loop becomes too large.

In our experience aggressive inlining for fusing linear segments of pipelines even if they are pretty big has almost never been a problem. If it becomes large we can always break the pipeline by using a NOINLINE, making a trade off between runtime perf and compile time resource requirement, but we rarely needed to do that. The code bloat comes more often from the spec-constr optimization when we are fusing recursive loops. In some cases we have used -fspec-constr-recursive=16, which I admit is a very large value, the compile times can sometimes increase dramatically due to this. We can always choose to use a lower value which may result in unfused code in some cases. There may be scope of improvement in GHC for making this optimization smarter and more selective.

We can also put some hard thresholds or heuristics here for the cases when the resulting monolithic functions become too large. One possible way to decide whether to inline or not could be to check if the size of the combined/optimized function is smaller than or roughly equivalent to the sum of the individual parts. If inlining does not result in elimination of the constructors in subsequent passes, then the inlining is not useful. But we have seen that it is almost always possible to eliminate all or most of the constructors.

...
Translating this to the fold/build fusion system in GHC today, we would force inlining of any function that calls `foldr` or `build` (or `augment`). Essentially, we would want to inline any binding that refers to a function that partakes in the rewrite rules for fusion. I'll admit we probably don't have to generalise this to all rewrite rules, but even with this restriction it seems like a rather aggressive optimisation strategy to me.

I do not have much experience with using foldr/build. But conceptually this is quite similar to what we do in stream fusion and similar ideas should be applicable here as well. In this case we inline to expose foldr/build primitives and then use rewrite rules to combine those. If the inlining does not result in further elimination or firing of rewrite rules then the inlining is not useful. You can probably try out automatic inlining in the same way as we do in fusion-plugin and see if code-bloat/compilation times become a real issue, and in which particular cases, then put heuristics/thresholds to avoid that. Is there a driving use case for extending this to foldr/build?

-harendra _______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

Sebastian Graf

1:27 p.m.

(Perhaps I should have said "returning `Step`" instead of "involving `Step`".) Am Di., 9. Dez. 2025 um 14:26 Uhr schrieb Sebastian Graf < sgraf1337@gmail.com>:

...

I agree that multi-stage programming is the principled green field solution. It provides the right model in which to think about solutions for GHC. For GHC, that model certainly means "Step is just an intermediate data structure", i.e., any computation involving `Step` should only live at compile-time ("template polymorphism") and it should be inlined/specialised for aggressively.

This is not too different for the situation with list fusion. Any compiled program that *runs* `map`/`filter`/`foldr`/`build` has a performance bug. You always want to end up with `foldr`'s SAT'd unfolding/explicit `(:)`/`[]` instead of a higher-order function `build` which does the same thing. That is achieved by the intricate multiphase rewrite rule framework. If all goes well, the final program never contains any lists whatsoever; only a nesting of tight recursive loops. Note that there is no code sharing of `map`/`filter`/`foldr` here; they must all have been inlined and rewritten and were purely a compile-time thing.

We want the same for stream fusion because it has better fusion support for `zipWith`. Inlining any function involving `Step` is absolutely the right thing to do! Aggressive specialization as well, but that's really difficult to do using SpecConstr because it blows up so easily. The stream fusion paper https://www.cs.tufts.edu/~nr/cs257/archive/duncan-coutts/stream-fusion.pdf instead suggests applying the static argument transformation in section 7.2. A draft patch of mine tried to do that on the fly to compute unfoldings for recursive functions with static args: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4553. It lacks polishing and my time and interest, but I firmly believe this should help the stream fusion use case, because SAT+inlining is *exactly* specialization of recursive functions for static arguments. (Side Note: Andrasz pointed out to me two years ago that there were further complications which I don't recall and which mean that he settled on the practical compromise of going for foldr/build fusion and enabling zip fusion/parallel for with stream fusion.)

---

...
what happens if you have sharing let ys :: Stream Int = h x in f ys + g ys

This is a flaw in the way that GHC says "lists are streams". IMO a stream (such as in streamly) should happily duplicate all the work that goes into computing the elements of `ys`, since it is just code. By contrast, data structures such as lists do not have that luxury. `let ys :: [Int] = ...; in f ys ++ g ys` should not try to fuse `ys`. Here's a repo that implements just foldr/build fusion without rewriting lists to/from it: https://github.com/sgraf812/foldr-build. It doesn't have the "fusion duplicates work" flaw because there is no data that should have shared the work in the first place.

---

*TLDR;* inlining and specialising `Step` sounds like exactly what is needed. Specialisation of concatMap is the hard part for stream fusion, though, and that means we'll end up with `Step` in the program nonetheless.

Btw., Lean's standard `Iterator` framework https://github.com/leanprover/lean4/blob/6e711bf067a0d9dc6623996b4bd3757ac6c... (inspired by Rust's https://doc.rust-lang.org/std/iter/trait.Iterator.html) will implement "stream fusion". (That's what external iteration is.) I'm not saying that it's simple or that the optimiser is equipped to handle it, though :)

Cheers, Sebastian

Am Di., 9. Dez. 2025 um 13:45 Uhr schrieb Simon Peyton Jones < simon.peytonjones@gmail.com>:

...
I think a solution based on multi-stage programming is the most

...
natural way to tackle the stream fusion problem in a robust manner.

How close are we to being able to do that in Haskell?

Simon

On Tue, 9 Dec 2025 at 09:56, Matthew Pickering < matthewtpickering@gmail.com> wrote:

...
I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

For example: https://andraskovacs.github.io/pdfs/2ltt_icfp24.pdf

By allowing a programmer to express which parts of the program should only exist code generation time, you guarantee that fusion occurs. This seems much more aligned with how Haskell programmers think about other things.

Cheers,

Matt

On Tue, Dec 9, 2025 at 12:03 AM Harendra Kumar wrote:

...
On Mon, 8 Dec 2025 at 23:35, Jaro Reinders

wrote:

...
...
I suggested we could integrate this into GHC by giving an inlining

discount to

...
any function which contains function that is also mentioned in a rewrite rule.

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors, the key optimizations for achieving this are case-of-case and worker-wrapper. Both types of fusion require inlining but they are clearly different at the implementation level, even though the same high level ideas may apply in both cases.

The fusion-plugin deals with stream fusion only -- where inlining and spec-constr are used to expose constructors which are then eliminated by case-of-case (for linear pipelines) or worker wrapper (for recursive loops) transformations. fusion-plugin has been used extensively in streamly and we have been able to fuse almost all cases that we wanted to. I would recommend that you try out fusion-plugin on the vector package, it can help fuse the pesky cases and you can also gain some insight into how much code bloat it actually causes in practice. Vector is primarily based on stream fusion, and even though it uses rewrite rules, those are not for fusion as such I believe.

...
However, I also noted I think this will have a large impact on code size and compilation time in larger code bases. Simon agreed, but noted that Harendra's plugin is much more selective than what I suggested.

Typically we have a pipeline of functions connected by fusible constructors. To fuse the entire pipeline completely we need to inline all the functions involved in the pipeline. Thus instead of having multiple small functions we end up creating one large combined function. However, when we use case-of-case transformation on the combined function, the size often gets reduced dramatically. In fact the total size of the end product is usually smaller than the combined original components in the pipeline. However, ghc requires more memory and CPU to simplify, especially if the loop becomes too large.

In our experience aggressive inlining for fusing linear segments of pipelines even if they are pretty big has almost never been a problem. If it becomes large we can always break the pipeline by using a NOINLINE, making a trade off between runtime perf and compile time resource requirement, but we rarely needed to do that. The code bloat comes more often from the spec-constr optimization when we are fusing recursive loops. In some cases we have used -fspec-constr-recursive=16, which I admit is a very large value, the compile times can sometimes increase dramatically due to this. We can always choose to use a lower value which may result in unfused code in some cases. There may be scope of improvement in GHC for making this optimization smarter and more selective.

We can also put some hard thresholds or heuristics here for the cases when the resulting monolithic functions become too large. One possible way to decide whether to inline or not could be to check if the size of the combined/optimized function is smaller than or roughly equivalent to the sum of the individual parts. If inlining does not result in elimination of the constructors in subsequent passes, then the inlining is not useful. But we have seen that it is almost always possible to eliminate all or most of the constructors.

...
Translating this to the fold/build fusion system in GHC today, we would force inlining of any function that calls `foldr` or `build` (or `augment`). Essentially, we would want to inline any binding that refers to a function that partakes in the rewrite rules for fusion. I'll admit we probably don't have to generalise this to all rewrite rules, but even with this restriction it seems like a rather aggressive optimisation strategy to me.

I do not have much experience with using foldr/build. But conceptually this is quite similar to what we do in stream fusion and similar ideas should be applicable here as well. In this case we inline to expose foldr/build primitives and then use rewrite rules to combine those. If the inlining does not result in further elimination or firing of rewrite rules then the inlining is not useful. You can probably try out automatic inlining in the same way as we do in fusion-plugin and see if code-bloat/compilation times become a real issue, and in which particular cases, then put heuristics/thresholds to avoid that. Is there a driving use case for extending this to foldr/build?

-harendra _______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

Iavor Diatchki

4:35 p.m.

Hello, I am sure folks are aware of this, but since it has not been mentioned another example of useful fusion is `to . from` from `Generics`, where we'd like the representation types to never exist at runtime. I think conceptually the problem is similar, but it's a bit interesting in that it has a different flavor to streams, and also there's a type family involved. Cheers, Iavor On Tue, Dec 9, 2025 at 5:27 AM Sebastian Graf wrote:

...

(Perhaps I should have said "returning `Step`" instead of "involving `Step`".)

Am Di., 9. Dez. 2025 um 14:26 Uhr schrieb Sebastian Graf < sgraf1337@gmail.com>:

...
I agree that multi-stage programming is the principled green field solution. It provides the right model in which to think about solutions for GHC. For GHC, that model certainly means "Step is just an intermediate data structure", i.e., any computation involving `Step` should only live at compile-time ("template polymorphism") and it should be inlined/specialised for aggressively.

This is not too different for the situation with list fusion. Any compiled program that *runs* `map`/`filter`/`foldr`/`build` has a performance bug. You always want to end up with `foldr`'s SAT'd unfolding/explicit `(:)`/`[]` instead of a higher-order function `build` which does the same thing. That is achieved by the intricate multiphase rewrite rule framework. If all goes well, the final program never contains any lists whatsoever; only a nesting of tight recursive loops. Note that there is no code sharing of `map`/`filter`/`foldr` here; they must all have been inlined and rewritten and were purely a compile-time thing.

We want the same for stream fusion because it has better fusion support for `zipWith`. Inlining any function involving `Step` is absolutely the right thing to do! Aggressive specialization as well, but that's really difficult to do using SpecConstr because it blows up so easily. The stream fusion paper https://www.cs.tufts.edu/~nr/cs257/archive/duncan-coutts/stream-fusion.pdf instead suggests applying the static argument transformation in section 7.2. A draft patch of mine tried to do that on the fly to compute unfoldings for recursive functions with static args: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4553. It lacks polishing and my time and interest, but I firmly believe this should help the stream fusion use case, because SAT+inlining is *exactly* specialization of recursive functions for static arguments. (Side Note: Andrasz pointed out to me two years ago that there were further complications which I don't recall and which mean that he settled on the practical compromise of going for foldr/build fusion and enabling zip fusion/parallel for with stream fusion.)

---

...
what happens if you have sharing let ys :: Stream Int = h x in f ys + g ys

This is a flaw in the way that GHC says "lists are streams". IMO a stream (such as in streamly) should happily duplicate all the work that goes into computing the elements of `ys`, since it is just code. By contrast, data structures such as lists do not have that luxury. `let ys :: [Int] = ...; in f ys ++ g ys` should not try to fuse `ys`. Here's a repo that implements just foldr/build fusion without rewriting lists to/from it: https://github.com/sgraf812/foldr-build. It doesn't have the "fusion duplicates work" flaw because there is no data that should have shared the work in the first place.

---

*TLDR;* inlining and specialising `Step` sounds like exactly what is needed. Specialisation of concatMap is the hard part for stream fusion, though, and that means we'll end up with `Step` in the program nonetheless.

Btw., Lean's standard `Iterator` framework https://github.com/leanprover/lean4/blob/6e711bf067a0d9dc6623996b4bd3757ac6c... (inspired by Rust's https://doc.rust-lang.org/std/iter/trait.Iterator.html) will implement "stream fusion". (That's what external iteration is.) I'm not saying that it's simple or that the optimiser is equipped to handle it, though :)

Cheers, Sebastian

Am Di., 9. Dez. 2025 um 13:45 Uhr schrieb Simon Peyton Jones < simon.peytonjones@gmail.com>:

...
I think a solution based on multi-stage programming is the most

...
natural way to tackle the stream fusion problem in a robust manner.

How close are we to being able to do that in Haskell?

Simon

On Tue, 9 Dec 2025 at 09:56, Matthew Pickering < matthewtpickering@gmail.com> wrote:

...
I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

For example: https://andraskovacs.github.io/pdfs/2ltt_icfp24.pdf

By allowing a programmer to express which parts of the program should only exist code generation time, you guarantee that fusion occurs. This seems much more aligned with how Haskell programmers think about other things.

Cheers,

Matt

On Tue, Dec 9, 2025 at 12:03 AM Harendra Kumar < harendra.kumar@gmail.com> wrote:

...
On Mon, 8 Dec 2025 at 23:35, Jaro Reinders

wrote:

...
...
I suggested we could integrate this into GHC by giving an inlining

discount to

...
any function which contains function that is also mentioned in a rewrite rule.

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors, the key optimizations for achieving this are case-of-case and worker-wrapper. Both types of fusion require inlining but they are clearly different at the implementation level, even though the same high level ideas may apply in both cases.

The fusion-plugin deals with stream fusion only -- where inlining and spec-constr are used to expose constructors which are then eliminated by case-of-case (for linear pipelines) or worker wrapper (for recursive loops) transformations. fusion-plugin has been used extensively in streamly and we have been able to fuse almost all cases that we wanted to. I would recommend that you try out fusion-plugin on the vector package, it can help fuse the pesky cases and you can also gain some insight into how much code bloat it actually causes in practice. Vector is primarily based on stream fusion, and even though it uses rewrite rules, those are not for fusion as such I believe.

...
However, I also noted I think this will have a large impact on code size and compilation time in larger code bases. Simon agreed, but noted that Harendra's plugin is much more selective than what I suggested.

Typically we have a pipeline of functions connected by fusible constructors. To fuse the entire pipeline completely we need to inline all the functions involved in the pipeline. Thus instead of having multiple small functions we end up creating one large combined function. However, when we use case-of-case transformation on the combined function, the size often gets reduced dramatically. In fact the total size of the end product is usually smaller than the combined original components in the pipeline. However, ghc requires more memory and CPU to simplify, especially if the loop becomes too large.

In our experience aggressive inlining for fusing linear segments of pipelines even if they are pretty big has almost never been a problem. If it becomes large we can always break the pipeline by using a NOINLINE, making a trade off between runtime perf and compile time resource requirement, but we rarely needed to do that. The code bloat comes more often from the spec-constr optimization when we are fusing recursive loops. In some cases we have used -fspec-constr-recursive=16, which I admit is a very large value, the compile times can sometimes increase dramatically due to this. We can always choose to use a lower value which may result in unfused code in some cases. There may be scope of improvement in GHC for making this optimization smarter and more selective.

We can also put some hard thresholds or heuristics here for the cases when the resulting monolithic functions become too large. One possible way to decide whether to inline or not could be to check if the size of the combined/optimized function is smaller than or roughly equivalent to the sum of the individual parts. If inlining does not result in elimination of the constructors in subsequent passes, then the inlining is not useful. But we have seen that it is almost always possible to eliminate all or most of the constructors.

...
Translating this to the fold/build fusion system in GHC today, we would force inlining of any function that calls `foldr` or `build` (or `augment`). Essentially, we would want to inline any binding that refers to a function that partakes in the rewrite rules for fusion. I'll admit we probably don't have to generalise this to all rewrite rules, but even with this restriction it seems like a rather aggressive optimisation strategy to me.

I do not have much experience with using foldr/build. But conceptually this is quite similar to what we do in stream fusion and similar ideas should be applicable here as well. In this case we inline to expose foldr/build primitives and then use rewrite rules to combine those. If the inlining does not result in further elimination or firing of rewrite rules then the inlining is not useful. You can probably try out automatic inlining in the same way as we do in fusion-plugin and see if code-bloat/compilation times become a real issue, and in which particular cases, then put heuristics/thresholds to avoid that. Is there a driving use case for extending this to foldr/build?

-harendra _______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

Matthew Pickering

10 Dec 10 Dec

2:31 p.m.

When it comes to fusion of `Generics`, we wrote a paper about performing that fusion in a multi-stage language: https://mpickering.github.io/papers/staged-sop.pdf In general there are soundness issues with Typed Template Haskell which make programming like this brittle in practice, which is what I discussed in my PhD thesis. Matt On Tue, Dec 9, 2025 at 4:35 PM Iavor Diatchki wrote:

...

Hello, I am sure folks are aware of this, but since it has not been mentioned another example of useful fusion is `to . from` from `Generics`, where we'd like the representation types to never exist at runtime. I think conceptually the problem is similar, but it's a bit interesting in that it has a different flavor to streams, and also there's a type family involved. Cheers, Iavor

On Tue, Dec 9, 2025 at 5:27 AM Sebastian Graf wrote:

...
(Perhaps I should have said "returning `Step`" instead of "involving `Step`".)

Am Di., 9. Dez. 2025 um 14:26 Uhr schrieb Sebastian Graf :

...
I agree that multi-stage programming is the principled green field solution. It provides the right model in which to think about solutions for GHC. For GHC, that model certainly means "Step is just an intermediate data structure", i.e., any computation involving `Step` should only live at compile-time ("template polymorphism") and it should be inlined/specialised for aggressively.

This is not too different for the situation with list fusion. Any compiled program that *runs* `map`/`filter`/`foldr`/`build` has a performance bug. You always want to end up with `foldr`'s SAT'd unfolding/explicit `(:)`/`[]` instead of a higher-order function `build` which does the same thing. That is achieved by the intricate multiphase rewrite rule framework. If all goes well, the final program never contains any lists whatsoever; only a nesting of tight recursive loops. Note that there is no code sharing of `map`/`filter`/`foldr` here; they must all have been inlined and rewritten and were purely a compile-time thing.

We want the same for stream fusion because it has better fusion support for `zipWith`. Inlining any function involving `Step` is absolutely the right thing to do! Aggressive specialization as well, but that's really difficult to do using SpecConstr because it blows up so easily. The stream fusion paper instead suggests applying the static argument transformation in section 7.2. A draft patch of mine tried to do that on the fly to compute unfoldings for recursive functions with static args: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4553. It lacks polishing and my time and interest, but I firmly believe this should help the stream fusion use case, because SAT+inlining is *exactly* specialization of recursive functions for static arguments. (Side Note: Andrasz pointed out to me two years ago that there were further complications which I don't recall and which mean that he settled on the practical compromise of going for foldr/build fusion and enabling zip fusion/parallel for with stream fusion.)

---

...
what happens if you have sharing let ys :: Stream Int = h x in f ys + g ys

This is a flaw in the way that GHC says "lists are streams". IMO a stream (such as in streamly) should happily duplicate all the work that goes into computing the elements of `ys`, since it is just code. By contrast, data structures such as lists do not have that luxury. `let ys :: [Int] = ...; in f ys ++ g ys` should not try to fuse `ys`. Here's a repo that implements just foldr/build fusion without rewriting lists to/from it: https://github.com/sgraf812/foldr-build. It doesn't have the "fusion duplicates work" flaw because there is no data that should have shared the work in the first place.

---

TLDR; inlining and specialising `Step` sounds like exactly what is needed. Specialisation of concatMap is the hard part for stream fusion, though, and that means we'll end up with `Step` in the program nonetheless.

Btw., Lean's standard `Iterator` framework (inspired by Rust's) will implement "stream fusion". (That's what external iteration is.) I'm not saying that it's simple or that the optimiser is equipped to handle it, though :)

Cheers, Sebastian

Am Di., 9. Dez. 2025 um 13:45 Uhr schrieb Simon Peyton Jones :

...
...
I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

How close are we to being able to do that in Haskell?

Simon

On Tue, 9 Dec 2025 at 09:56, Matthew Pickering wrote:

...
I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

For example: https://andraskovacs.github.io/pdfs/2ltt_icfp24.pdf

By allowing a programmer to express which parts of the program should only exist code generation time, you guarantee that fusion occurs. This seems much more aligned with how Haskell programmers think about other things.

Cheers,

Matt

On Tue, Dec 9, 2025 at 12:03 AM Harendra Kumar wrote:

...
On Mon, 8 Dec 2025 at 23:35, Jaro Reinders wrote: > > I suggested we could integrate this into GHC by giving an inlining discount to > any function which contains function that is also mentioned in a rewrite rule.

There are two distinct fusion mechanisms under consideration here - foldr/build fusion and stream fusion. Rewrite rules allow us to replace a combination of functions with an equivalent function, forming the basis of foldr/build fusion by fusing or eliminating intermediate functions. Stream fusion on the other hand eliminates intermediate constructors, the key optimizations for achieving this are case-of-case and worker-wrapper. Both types of fusion require inlining but they are clearly different at the implementation level, even though the same high level ideas may apply in both cases.

The fusion-plugin deals with stream fusion only -- where inlining and spec-constr are used to expose constructors which are then eliminated by case-of-case (for linear pipelines) or worker wrapper (for recursive loops) transformations. fusion-plugin has been used extensively in streamly and we have been able to fuse almost all cases that we wanted to. I would recommend that you try out fusion-plugin on the vector package, it can help fuse the pesky cases and you can also gain some insight into how much code bloat it actually causes in practice. Vector is primarily based on stream fusion, and even though it uses rewrite rules, those are not for fusion as such I believe.

> However, I also noted I think this will have a large impact on code size and > compilation time in larger code bases. Simon agreed, but noted that Harendra's > plugin is much more selective than what I suggested.

Typically we have a pipeline of functions connected by fusible constructors. To fuse the entire pipeline completely we need to inline all the functions involved in the pipeline. Thus instead of having multiple small functions we end up creating one large combined function. However, when we use case-of-case transformation on the combined function, the size often gets reduced dramatically. In fact the total size of the end product is usually smaller than the combined original components in the pipeline. However, ghc requires more memory and CPU to simplify, especially if the loop becomes too large.

In our experience aggressive inlining for fusing linear segments of pipelines even if they are pretty big has almost never been a problem. If it becomes large we can always break the pipeline by using a NOINLINE, making a trade off between runtime perf and compile time resource requirement, but we rarely needed to do that. The code bloat comes more often from the spec-constr optimization when we are fusing recursive loops. In some cases we have used -fspec-constr-recursive=16, which I admit is a very large value, the compile times can sometimes increase dramatically due to this. We can always choose to use a lower value which may result in unfused code in some cases. There may be scope of improvement in GHC for making this optimization smarter and more selective.

We can also put some hard thresholds or heuristics here for the cases when the resulting monolithic functions become too large. One possible way to decide whether to inline or not could be to check if the size of the combined/optimized function is smaller than or roughly equivalent to the sum of the individual parts. If inlining does not result in elimination of the constructors in subsequent passes, then the inlining is not useful. But we have seen that it is almost always possible to eliminate all or most of the constructors.

> Translating this to the fold/build fusion system in GHC today, we would force > inlining of any function that calls `foldr` or `build` (or `augment`). > Essentially, we would want to inline any binding that refers to a function that > partakes in the rewrite rules for fusion. I'll admit we probably don't have to > generalise this to all rewrite rules, but even with this restriction it seems > like a rather aggressive optimisation strategy to me.

I do not have much experience with using foldr/build. But conceptually this is quite similar to what we do in stream fusion and similar ideas should be applicable here as well. In this case we inline to expose foldr/build primitives and then use rewrite rules to combine those. If the inlining does not result in further elimination or firing of rewrite rules then the inlining is not useful. You can probably try out automatic inlining in the same way as we do in fusion-plugin and see if code-bloat/compilation times become a real issue, and in which particular cases, then put heuristics/thresholds to avoid that. Is there a driving use case for extending this to foldr/build?

-harendra _______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

Facundo Domínguez

2:49 p.m.

Hello! I'm curious about how the approach based on metaprogramming is better than as a compiler optimization + inspection-testing [1]. If someone has opinions on the matter, I think they would be interesting to read. Cheers! Facundo [1] https://hackage.haskell.org/package/inspection-testing On Wed, Dec 10, 2025 at 11:31 AM Matthew Pickering < matthewtpickering@gmail.com> wrote:

...

When it comes to fusion of `Generics`, we wrote a paper about performing that fusion in a multi-stage language: https://mpickering.github.io/papers/staged-sop.pdf

In general there are soundness issues with Typed Template Haskell which make programming like this brittle in practice, which is what I discussed in my PhD thesis.

Matt

On Tue, Dec 9, 2025 at 4:35 PM Iavor Diatchki wrote:

...
Hello, I am sure folks are aware of this, but since it has not been mentioned

...
Cheers, Iavor

On Tue, Dec 9, 2025 at 5:27 AM Sebastian Graf wrote:

...
(Perhaps I should have said "returning `Step`" instead of "involving

`Step`".)

...
Am Di., 9. Dez. 2025 um 14:26 Uhr schrieb Sebastian Graf <

sgraf1337@gmail.com>:

...
...
I agree that multi-stage programming is the principled green field

solution. It provides the right model in which to think about solutions for GHC.

...
For GHC, that model certainly means "Step is just an intermediate data structure", i.e., any computation involving `Step` should only live at compile-time ("template polymorphism") and it should be inlined/specialised for aggressively.

This is not too different for the situation with list fusion. Any compiled program that *runs* `map`/`filter`/`foldr`/`build` has a

...
...
...
If all goes well, the final program never contains any lists whatsoever; only a nesting of tight recursive loops. Note that there is no code sharing of `map`/`filter`/`foldr` here; they must all have been inlined and rewritten and were purely a compile-time thing.

We want the same for stream fusion because it has better fusion support for `zipWith`. Inlining any function involving `Step` is absolutely

...
...
...
(Side Note: Andrasz pointed out to me two years ago that there were further complications which I don't recall and which mean that he settled on the practical compromise of going for foldr/build fusion and enabling zip fusion/parallel for with stream fusion.)

---

...
what happens if you have sharing let ys :: Stream Int = h x in f ys + g ys

This is a flaw in the way that GHC says "lists are streams". IMO a stream (such as in streamly) should happily duplicate all the work that goes into computing the elements of `ys`, since it is just code. By contrast, data structures such as lists do not have that luxury. `let ys :: [Int] = ...; in f ys ++ g ys` should not try to fuse `ys`. Here's a repo that implements just foldr/build fusion without rewriting lists to/from it: https://github.com/sgraf812/foldr-build. It doesn't have the "fusion duplicates work" flaw because there is no data

...
...
...
---

TLDR; inlining and specialising `Step` sounds like exactly what is

needed. Specialisation of concatMap is the hard part for stream fusion,

...
...
...
Btw., Lean's standard `Iterator` framework (inspired by Rust's) will

implement "stream fusion". (That's what external iteration is.) I'm not saying that it's simple or that the optimiser is equipped to handle it,

...
...
...
Cheers, Sebastian

Am Di., 9. Dez. 2025 um 13:45 Uhr schrieb Simon Peyton Jones <

simon.peytonjones@gmail.com>:

...
...
...
I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

How close are we to being able to do that in Haskell?

Simon

On Tue, 9 Dec 2025 at 09:56, Matthew Pickering < matthewtpickering@gmail.com> wrote:

...
I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

For example: https://andraskovacs.github.io/pdfs/2ltt_icfp24.pdf

By allowing a programmer to express which parts of the program should only exist code generation time, you guarantee that fusion occurs. This seems much more aligned with how Haskell programmers think about other things.

Cheers,

Matt

On Tue, Dec 9, 2025 at 12:03 AM Harendra Kumar <

harendra.kumar@gmail.com> wrote:

...
> > On Mon, 8 Dec 2025 at 23:35, Jaro Reinders < jaro.reinders@gmail.com> wrote: > > > > I suggested we could integrate this into GHC by giving an inlining discount to > > any function which contains function that is also mentioned in a rewrite rule. > > There are two distinct fusion mechanisms under consideration here - > foldr/build fusion and stream fusion. Rewrite rules allow us to > replace a combination of functions with an equivalent function, > forming the basis of foldr/build fusion by fusing or eliminating > intermediate functions. Stream fusion on the other hand eliminates > intermediate constructors, the key optimizations for achieving this > are case-of-case and worker-wrapper. Both types of fusion require > inlining but they are clearly different at the implementation level, > even though the same high level ideas may apply in both cases. > > The fusion-plugin deals with stream fusion only -- where inlining and > spec-constr are used to expose constructors which are then eliminated > by case-of-case (for linear pipelines) or worker wrapper (for > recursive loops) transformations. fusion-plugin has been used > extensively in streamly and we have been able to fuse almost all cases > that we wanted to. I would recommend that you try out fusion-plugin on > the vector package, it can help fuse the pesky cases and you can also > gain some insight into how much code bloat it actually causes in > practice. Vector is primarily based on stream fusion, and even

...
...
...
...
...
> it uses rewrite rules, those are not for fusion as such I believe. > > > However, I also noted I think this will have a large impact on code size and > > compilation time in larger code bases. Simon agreed, but noted

...
...
...
...
...
> > plugin is much more selective than what I suggested. > > Typically we have a pipeline of functions connected by fusible > constructors. To fuse the entire pipeline completely we need to inline > all the functions involved in the pipeline. Thus instead of having > multiple small functions we end up creating one large combined > function. However, when we use case-of-case transformation on the > combined function, the size often gets reduced dramatically. In fact > the total size of the end product is usually smaller than the combined > original components in the pipeline. However, ghc requires more memory > and CPU to simplify, especially if the loop becomes too large. > > In our experience aggressive inlining for fusing linear segments of > pipelines even if they are pretty big has almost never been a

...
...
...
...
...
> If it becomes large we can always break the pipeline by using a > NOINLINE, making a trade off between runtime perf and compile time > resource requirement, but we rarely needed to do that. The code bloat > comes more often from the spec-constr optimization when we are fusing > recursive loops. In some cases we have used > -fspec-constr-recursive=16, which I admit is a very large value,

...
...
...
...
...
> compile times can sometimes increase dramatically due to this. We can > always choose to use a lower value which may result in unfused code in > some cases. There may be scope of improvement in GHC for making

...
...
...
...
...
> optimization smarter and more selective. > > We can also put some hard thresholds or heuristics here for the cases > when the resulting monolithic functions become too large. One

...
...
...
...
...
> way to decide whether to inline or not could be to check if the size > of the combined/optimized function is smaller than or roughly > equivalent to the sum of the individual parts. If inlining does not > result in elimination of the constructors in subsequent passes,

...
...
...
...
...
> the inlining is not useful. But we have seen that it is almost always > possible to eliminate all or most of the constructors. > > > Translating this to the fold/build fusion system in GHC today, we would force > > inlining of any function that calls `foldr` or `build` (or `augment`). > > Essentially, we would want to inline any binding that refers to a function that > > partakes in the rewrite rules for fusion. I'll admit we probably don't have to > > generalise this to all rewrite rules, but even with this restriction it seems > > like a rather aggressive optimisation strategy to me. > > I do not have much experience with using foldr/build. But conceptually > this is quite similar to what we do in stream fusion and similar ideas > should be applicable here as well. In this case we inline to expose > foldr/build primitives and then use rewrite rules to combine

another example of useful fusion is `to . from` from `Generics`, where we'd like the representation types to never exist at runtime. I think conceptually the problem is similar, but it's a bit interesting in that it has a different flavor to streams, and also there's a type family involved. performance bug. You always want to end up with `foldr`'s SAT'd unfolding/explicit `(:)`/`[]` instead of a higher-order function `build` which does the same thing. That is achieved by the intricate multiphase rewrite rule framework. the right thing to do! Aggressive specialization as well, but that's really difficult to do using SpecConstr because it blows up so easily. The stream fusion paper instead suggests applying the static argument transformation in section 7.2. A draft patch of mine tried to do that on the fly to compute unfoldings for recursive functions with static args: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4553. It lacks polishing and my time and interest, but I firmly believe this should help the stream fusion use case, because SAT+inlining is *exactly* specialization of recursive functions for static arguments. that should have shared the work in the first place. though, and that means we'll end up with `Step` in the program nonetheless. though :) though that Harendra's problem. the this possible then those. If

...
...
...
...
...
> the inlining does not result in further elimination or firing of > rewrite rules then the inlining is not useful. You can probably try > out automatic inlining in the same way as we do in fusion-plugin and > see if code-bloat/compilation times become a real issue, and in which > particular cases, then put heuristics/thresholds to avoid that. Is > there a driving use case for extending this to foldr/build? > > -harendra > _______________________________________________ > ghc-devs mailing list -- ghc-devs@haskell.org > To unsubscribe send an email to ghc-devs-leave@haskell.org _______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

_______________________________________________ ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

ghc-devs mailing list -- ghc-devs@haskell.org To unsubscribe send an email to ghc-devs-leave@haskell.org

-- All views and opinions expressed in this email message are the personal opinions of the author and do not represent those of the organization or its customers.

Harendra Kumar

11 Dec 11 Dec

2 p.m.

On Tue, 9 Dec 2025 at 18:56, Sebastian Graf wrote:

...

We want the same for stream fusion because it has better fusion support for `zipWith`. Inlining any function involving `Step` is absolutely the right thing to do! Aggressive specialization as well, but that's really difficult to do using SpecConstr because it blows up so easily.

Yes, exactly. In the vector stream type, the state of the stream is exposed, so we can compose the states and step functions of multiple streams to create a combined, fused step. This enables producer-producer fusion e.g. zipWith or mergeBy can be fused. foldr/build only supports producer-consumer fusion and cannot support producer-producer fusion because the producer does not expose its internal state. Yes, SpecConstr can blow up especially when the state type is too big or is recursively defined. In practice the problematic case we have seen is when the state contains a list, because lists have unbounded structure and SpecConstr can try to specialize for many argument patterns when we are using something like -fspec-constr-recursive=16. We disabled SpecConstr in that situation to avoid code size explosion.

...

TLDR; inlining and specialising `Step` sounds like exactly what is needed. Specialisation of concatMap is the hard part for stream fusion, though, and that means we'll end up with `Step` in the program nonetheless.

We do not try to fuse concatMap, instead we recommend another fusible abstraction for loop nesting which is the Unfold type in streamly and related operations like unfoldEach. That is the right way to achieve nested loop fusion, with that we get fully fused nested stream code. It is not as powerful as concatMap but in most common cases we do not need that additional power of concatMap. If you really need the full power of concatMap, you cannot expect the same performance, because fully general concatMap fusion is not achievable without losing fusion or causing excessive specialization. -harendra

Harendra Kumar

1:20 p.m.

On Tue, 9 Dec 2025 at 15:23, Matthew Pickering wrote:

...

I think a solution based on multi-stage programming is the most natural way to tackle the stream fusion problem in a robust manner.

In theory, sure, but in practice multi-stage programming is not very ergonomic for most developers. It introduces yet another unfamiliar paradigm, which is a tough sell when Haskell is already niche. I would much rather improve the existing programming model with minimal disruption. Stream fusion already works remarkably well, and GHC is excellent at the required core transformations, it has all the necessary infrastructure. If we address the remaining corner cases and bring these improvements into GHC itself, we can get C-like performance out of the box. fusion-plugin and streamly have shown that the required fixes are small, the results are robust, and most real-world cases fuse cleanly, including streams, folds, parsers, and combinations of these. GHC’s ability to generate highly efficient code is underrated, I would love to see these enhancements become part of stock GHC. This is already proven, implemented and used in practice with hundreds of benchmarks, so can be done with a bit more effort. I would be more than happy to help if anyone wants to take this up. At the same time, we can look into making multi-stage programming more popular, but that would be a tough job I guess. -harendra

Age (days ago)

Last active (days ago)

List overview

Download

18 comments

9 participants

participants (9)

Andreas Klebinger
Facundo Domínguez
Georgi Lyubenov
Harendra Kumar
Iavor Diatchki
Jaro Reinders
Matthew Pickering
Sebastian Graf
Simon Peyton Jones

A future for fusion

tags

participants (9)