335654e0
by Rodrigo Mesquita at 2025-07-29T19:14:40+01:00
bytecode: Don't PUSH_L 0; SLIDE 1 1
While looking through bytecode I noticed a quite common unfortunate
pattern:
...
PUSH_L 0
SLIDE 1 1
We do this often by generically constructing a tail call from a function
atom that may be somewhere arbitrary on the stack.
However, for the special case that the function can be found directly on
top of the stack, as part of the arguments, it's plain redundant to push
then slide it.
In this commit we add a small optimisation to the generation of
tailcalls in bytecode. Simply: lookahead for the function in the stack.
If it is the first thing on the stack and it is part of the arguments
which would be dropped as we entered the tail call, then don't push then
slide it.
In a simple example (T26042b), this already produced a drastic
improvement in generated code (left is old, right is with this patch):
```diff
3c3
< 2025-07-29 10:14:02.081277 UTC
---
> 2025-07-29 10:50:36.560949 UTC
160,161c160
< PUSH_L 0
< SLIDE 1 2
---
> SLIDE 1 1
164,165d162
< PUSH_L 0
< SLIDE 1 1
175,176c172
< PUSH_L 0
< SLIDE 1 2
---
> SLIDE 1 1
179,180d174
< PUSH_L 0
< SLIDE 1 1
206,207d199
< PUSH_L 0
< SLIDE 1 1
210,211d201
< PUSH_L 0
< SLIDE 1 1
214,215d203
< PUSH_L 0
< SLIDE 1 1
218,219d205
< PUSH_L 0
< SLIDE 1 1
222,223d207
< PUSH_L 0
< SLIDE 1 1
333,334c317
< PUSH_L 0
< SLIDE 1 2
---
> SLIDE 1 1
337,338d319
< PUSH_L 0
< SLIDE 1 1
367,368c348
< PUSH_L 0
< SLIDE 1 2
---
> SLIDE 1 1
371,372c351
< PUSH_L 0
< SLIDE 1 2
---
> SLIDE 1 1
375,376d353
< PUSH_L 0
< SLIDE 1 1
379,380c356
< PUSH_L 0
< SLIDE 1 2
---
> SLIDE 1 1
383,384c359
< PUSH_L 0
< SLIDE 1 3
---
> SLIDE 1 2
387,388c362
< PUSH_L 0
< SLIDE 1 2
---
> SLIDE 1 1
417,418d390
< PUSH_L 0
< SLIDE 1 1
421,422c393
< PUSH_L 0
< SLIDE 1 4
---
> SLIDE 1 3
442,443c413
< PUSH_L 0
< SLIDE 1 2
---
> SLIDE 1 1
446,447c416
< PUSH_L 0
< SLIDE 1 4
---
> SLIDE 1 3
510,511c479
< PUSH_L 0
< SLIDE 1 2
---
> SLIDE 1 1
514,515d481
< PUSH_L 0
< SLIDE 1 1
600,601c566
< PUSH_L 0
< SLIDE 1 2
---
> SLIDE 1 1
604,605d568
< PUSH_L 0
< SLIDE 1 1
632,633d594
< PUSH_L 0
< SLIDE 1 1
636,637d596
< PUSH_L 0
< SLIDE 1 1
640,641d598
< PUSH_L 0
< SLIDE 1 1
644,645d600
< PUSH_L 0
< SLIDE 1 1
648,649d602
< PUSH_L 0
< SLIDE 1 1
652,653d604
< PUSH_L 0
< SLIDE 1 1
656,657d606
< PUSH_L 0
< SLIDE 1 1
660,661d608
< PUSH_L 0
< SLIDE 1 1
664,665d610
< PUSH_L 0
< SLIDE 1 1
```
I also compiled lib:Cabal to bytecode and counted the number of bytecode
lines with `find dist-newstyle -name "*.dump-BCOs" -exec wc {} +`:
with unoptimized core:
1190689 lines (before) - 1172891 lines (now)
= 17798 less redundant instructions (-1.5%)
with optimized core:
1924818 lines (before) - 1864836 lines (now)
= 59982 less redundant instructions (-3.1%)