panic when compiling SHA

newer
RE: [commit: packages/integer-gmp]...

Kazu Yamamoto

27 Dec 2013 27 Dec '13

1:07 a.m.

Hi, When I tried to build the SHA library with GHC head on on 32bit Linux, GHC head got panic. GHC 7.4.2 can build SHA on the same machine. Configuring SHA-1.6.1... Building SHA-1.6.1... Failed to install SHA-1.6.1 Last 10 lines of the build log ( /home/kazu/work/rpf/.cabal-sandbox/logs/SHA-1.6.1.log ): Preprocessing library SHA-1.6.1... [1 of 1] Compiling Data.Digest.Pure.SHA ( Data/Digest/Pure/SHA.hs, dist/dist-sandbox-ef3aaa11/build/Data/Digest/Pure/SHA.o ) ghc: panic! (the 'impossible' happened) (GHC version 7.7.20131202 for i386-unknown-linux): regSpill: out of spill slots! regs to spill = 1129 slots left = 677 Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug --Kazu

Show replies by date

Manuel Gómez

27 Dec 27 Dec

1:47 a.m.

On Thu, Dec 26, 2013 at 8:37 PM, Kazu Yamamoto wrote:

...

When I tried to build the SHA library with GHC head on on 32bit Linux, GHC head got panic. GHC 7.4.2 can build SHA on the same machine.

[1 of 1] Compiling Data.Digest.Pure.SHA ( Data/Digest/Pure/SHA.hs, dist/dist-sandbox-ef3aaa11/build/Data/Digest/Pure/SHA.o ) ghc: panic! (the 'impossible' happened) (GHC version 7.7.20131202 for i386-unknown-linux): regSpill: out of spill slots! regs to spill = 1129 slots left = 677

I ran into this a while ago. Turns out it’s a known bug[0], but it seems like it’s been hard to reproduce. FWIW, I also hit it on 32bit Linux. You can probably work around it with `-fno-regs-graph`. [0]: https://ghc.haskell.org/trac/ghc/ticket/5361

Ben Lippmeier

28 Dec 28 Dec

3:58 a.m.

On 27/12/2013, at 12:07 PM, Kazu Yamamoto (山本和彦) wrote:

...

Hi,

When I tried to build the SHA library with GHC head on on 32bit Linux, GHC head got panic. GHC 7.4.2 can build SHA on the same machine.

Configuring SHA-1.6.1... Building SHA-1.6.1... Failed to install SHA-1.6.1 Last 10 lines of the build log ( /home/kazu/work/rpf/.cabal-sandbox/logs/SHA-1.6.1.log ): Preprocessing library SHA-1.6.1... [1 of 1] Compiling Data.Digest.Pure.SHA ( Data/Digest/Pure/SHA.hs, dist/dist-sandbox-ef3aaa11/build/Data/Digest/Pure/SHA.o ) ghc: panic! (the 'impossible' happened) (GHC version 7.7.20131202 for i386-unknown-linux): regSpill: out of spill slots! regs to spill = 1129 slots left = 677

There are only a fixed number of register spill slots, and when they're all used the compiler can't dynamically allocate more of them. This SHA benchmark is pathological in that the intermediate code expands to have many variables with long, overlapping live ranges. The underlying problem is really that the inliner and/or other optimisations have gone crazy and made a huge intermediate program. We *could* give it more spill slots, to make it compile, but the generated code would be horrible. Try turning down the optimisation level, reduce inliner keenness, or reduce SpecConstr flags. Ben.

Simon Marlow

4 Jan 4 Jan

9:59 a.m.

On 28/12/13 03:58, Ben Lippmeier wrote:

...

On 27/12/2013, at 12:07 PM, Kazu Yamamoto (山本和彦) wrote:

...
Hi,

When I tried to build the SHA library with GHC head on on 32bit Linux, GHC head got panic. GHC 7.4.2 can build SHA on the same machine.

Configuring SHA-1.6.1... Building SHA-1.6.1... Failed to install SHA-1.6.1 Last 10 lines of the build log ( /home/kazu/work/rpf/.cabal-sandbox/logs/SHA-1.6.1.log ): Preprocessing library SHA-1.6.1... [1 of 1] Compiling Data.Digest.Pure.SHA ( Data/Digest/Pure/SHA.hs, dist/dist-sandbox-ef3aaa11/build/Data/Digest/Pure/SHA.o ) ghc: panic! (the 'impossible' happened) (GHC version 7.7.20131202 for i386-unknown-linux): regSpill: out of spill slots! regs to spill = 1129 slots left = 677

There are only a fixed number of register spill slots, and when they're all used the compiler can't dynamically allocate more of them.

Not true any more in 7.8+ with the linear allocator. I think it might still be true for the graph allocator, which is sadly suffering from a little bitrot and probably doesn't generate very good code with the new code generator. So, avoiding -fregs-graph should work around this with 7.8. Cheers, Simon

...

This SHA benchmark is pathological in that the intermediate code expands to have many variables with long, overlapping live ranges. The underlying problem is really that the inliner and/or other optimisations have gone crazy and made a huge intermediate program. We *could* give it more spill slots, to make it compile, but the generated code would be horrible.

Try turning down the optimisation level, reduce inliner keenness, or reduce SpecConstr flags.

Ben.

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Kazu Yamamoto

12:22 p.m.

Hi,

...

...
There are only a fixed number of register spill slots, and when they're all used the compiler can't dynamically allocate more of them.

Not true any more in 7.8+ with the linear allocator. I think it might still be true for the graph allocator, which is sadly suffering from a little bitrot and probably doesn't generate very good code with the new code generator.

So, avoiding -fregs-graph should work around this with 7.8.

I confirmed that removing -fregs-graph should work around this with 7.8. --Kazu

Ben Lippmeier

5 Jan 5 Jan

10:46 a.m.

On 04/01/2014, at 23:22 , Kazu Yamamoto (山本和彦) wrote:

...

Hi,

...
...
There are only a fixed number of register spill slots, and when they're all used the compiler can't dynamically allocate more of them.

Not true any more in 7.8+ with the linear allocator. I think it might still be true for the graph allocator, which is sadly suffering from a little bitrot and probably doesn't generate very good code with the new code generator.

So, avoiding -fregs-graph should work around this with 7.8.

I confirmed that removing -fregs-graph should work around this with 7.8.

Ok, my mistake. We originally added -fregs-graph when compiling that module because both allocators had a fixed stack size, but the graph allocator did a better job of allocation and avoided overflowing the stack. Note that removing the flag isn't a "solution" to the underlying problem of the intermediate code being awful. Switching to the linear allocator just permits compilation of core code that was worse than before. Now it needs to spill more registers when compiling the same source code. Ben.

Kazu Yamamoto

6 Jan 6 Jan

3:08 a.m.

Ben,

...

Note that removing the flag isn't a "solution" to the underlying problem of the intermediate code being awful. Switching to the linear allocator just permits compilation of core code that was worse than before. Now it needs to spill more registers when compiling the same source code.

So, would you reopen #5361 by yourself? https://ghc.haskell.org/trac/ghc/ticket/5361 --Kazu

Ben Lippmeier

8:20 a.m.

On 06/01/2014, at 14:08 , Kazu Yamamoto (山本和彦) wrote:

...

Ben,

...
Note that removing the flag isn't a "solution" to the underlying problem of the intermediate code being awful. Switching to the linear allocator just permits compilation of core code that was worse than before. Now it needs to spill more registers when compiling the same source code.

So, would you reopen #5361 by yourself?

https://ghc.haskell.org/trac/ghc/ticket/5361

Not if we just have this one test. I'd be keen to blame excessive use of inline pragmas in the SHA library itself, or excessive optimisation flags. It's not really a bug in GHC until there are two tests that exhibit the same problem. Ben.

Adam Wick

10:26 p.m.

On Jan 6, 2014, at 12:20 AM, Ben Lippmeier wrote:

...

On 06/01/2014, at 14:08 , Kazu Yamamoto (山本和彦) wrote:

...
Ben,

...
Note that removing the flag isn't a "solution" to the underlying problem of the intermediate code being awful. Switching to the linear allocator just permits compilation of core code that was worse than before. Now it needs to spill more registers when compiling the same source code.

So, would you reopen #5361 by yourself?

https://ghc.haskell.org/trac/ghc/ticket/5361

Not if we just have this one test. I'd be keen to blame excessive use of inline pragmas in the SHA library itself, or excessive optimisation flags. It's not really a bug in GHC until there are two tests that exhibit the same problem.

The SHA library uses SPECIALIZE, INLINE, and bang patterns in fairly standard ways. There’s nothing too exotic in there, I just basically sprinkled hints in places I thought would be useful, and then backed those up with benchmarking. If GHC simply emitted rotten code in this case, I’d agree: wait for more examples, and put the onus on the developer to make it work better. However, right now, GHC crashes on valid input. Which is a bug. So I’d argue that the ticket should be re-opened. I suppose, alternatively, the documentation on SPECIALIZE, INLINE, and bang patterns could be changed to note that using them is not officially supported. If the problem is pretty fundamental, then perhaps instead of panicking and dying, GHC should instead default back to a worse register allocator. Perhaps it could print a warning when that happens, but that’s optional. That would be an easier way to fix this bug if there are deeper algorithmic problems, or if fixing it for SHA would simply move the failure line a little further down the field. (Obviously this route opens a performance regression on my end, but hey, that’s my problem.) - Adam

Ben Lippmeier

7 Jan 7 Jan

10:27 a.m.

On 07/01/2014, at 9:26 , Adam Wick wrote:

...

...
Not if we just have this one test. I'd be keen to blame excessive use of inline pragmas in the SHA library itself, or excessive optimisation flags. It's not really a bug in GHC until there are two tests that exhibit the same problem.

The SHA library uses SPECIALIZE, INLINE, and bang patterns in fairly standard ways. There’s nothing too exotic in there, I just basically sprinkled hints in places I thought would be useful, and then backed those up with benchmarking.

Ahh. It's the "sprinkled hints in places I thought would be useful" which is what I'm concerned about. If you just add pragmas without understanding their effect on the core program then it'll bite further down the line. Did you compare the object code size as well as wall clock speedup?

...

If GHC simply emitted rotten code in this case, I’d agree: wait for more examples, and put the onus on the developer to make it work better. However, right now, GHC crashes on valid input. Which is a bug. So I’d argue that the ticket should be re-opened. I suppose, alternatively, the documentation on SPECIALIZE, INLINE, and bang patterns could be changed to note that using them is not officially supported.

Sadly, "valid input" isn't a well defined concept in practice. You could write a "valid" 10GB Haskell source file that obeyed the Haskell standard grammar, but I wouldn't expect that to compile either. You could also write small (< 1k) source programs that trigger complexity problems in Hindley-Milner style type inference. You could also use compile-time meta programming (like Template Haskell) to generate intermediate code that is well formed but much too big to compile. The fact that a program obeys a published grammar is not sufficient to expect it to compile with a particular implementation (sorry to say).

...

If the problem is pretty fundamental, then perhaps instead of panicking and dying, GHC should instead default back to a worse register allocator. Perhaps it could print a warning when that happens, but that’s optional. That would be an easier way to fix this bug if there are deeper algorithmic problems, or if fixing it for SHA would simply move the failure line a little further down the field. (Obviously this route opens a performance regression on my end, but hey, that’s my problem.)

Adding an INLINE pragma is akin to using compile-time meta programming. I suspect your meta programming is more broken than GHC in this case, but I'd be happy to be proven otherwise. Right now the panic from the register allocator is all the feedback you've got that something is wrong, and the SHA library is the only one I've seen that causes this problem. See above discussion about "valid input". Ben.

Jan Stolarek

12:11 p.m.

...

GHC crashes on valid input. Which is a bug. As Ben pointed out it is conceivable that compiler will not be able handle a correct program. But as a user I would expect GHC to detect such situations (if possible) and display an error message, not crash with a panic (which clearly says this is a bug and should be reported).

Janek

Adam Wick

11:43 p.m.

On Jan 7, 2014, at 4:11 AM, Jan Stolarek wrote:

...

...
GHC crashes on valid input. Which is a bug. As Ben pointed out it is conceivable that compiler will not be able handle a correct program.

Personally, I find this view extremely disappointing. If my SHA library failed to work on a valid input, I would consider that a bug. Why is GHC special? Keep in mind that I’m not saying that this bug needs to be highest priority and fixed immediately, but instead I’m merely arguing that it should be acknowledged as a bug.

...

But as a user I would expect GHC to detect such situations (if possible) and display an error message, not crash with a panic (which clearly says this is a bug and should be reported).

Personally, I’d find this an acceptable, if a bit disappointing, solution. Essentially you’re redefining "valid input." It just seems a shame to be doing so because of an implementation weakness rather than an actual, fundamental problem. - Adam

Adam Wick

11:57 p.m.

On Jan 7, 2014, at 2:27 AM, Ben Lippmeier wrote:

...

On 07/01/2014, at 9:26 , Adam Wick wrote:

...
...
Not if we just have this one test. I'd be keen to blame excessive use of inline pragmas in the SHA library itself, or excessive optimisation flags. It's not really a bug in GHC until there are two tests that exhibit the same problem.

The SHA library uses SPECIALIZE, INLINE, and bang patterns in fairly standard ways. There’s nothing too exotic in there, I just basically sprinkled hints in places I thought would be useful, and then backed those up with benchmarking.

Ahh. It's the "sprinkled hints in places I thought would be useful" which is what I'm concerned about. If you just add pragmas without understanding their effect on the core program then it'll bite further down the line. Did you compare the object code size as well as wall clock speedup?

I understand the pragmas and what they do with my code. I use SPECIALIZE twice for two functions. In both functions, it was clearer to write the function as (a -> a -> a -> a), but I wanted specialized versions for the two versions that were going to be used, in which (a == Word32) or (a == Word64). This benchmarked as faster while maintaining code clarity and concision. I use INLINE in five places, each of them a SHA step function, with the understanding that it would generate ideal code for a compiler for the performance-critical parts of the algorithm: straight line, single-block code with no conditionals. When I did my original performance work, several versions of GHC ago, I did indeed consider compile time, runtime performance, and space usage. I picked what I thought was a reasonable balance at the time. I also just performed an experiment in which I took the SHA library, deleted all instances of INLINE and SPECIALIZE, and compiled it with HEAD on 32-bit Linux. You get the same crash. So my usage of SPECIALIZE and INLINE is beside the point.

...

Sadly, "valid input" isn't a well defined concept in practice. You could write a "valid" 10GB Haskell source file that obeyed the Haskell standard grammar, but I wouldn't expect that to compile either.

I would. I’m a little disappointed that ghc-devs does not. I wouldn’t expect it to compile quickly, but I would expect it to run.

...

You could also write small (< 1k) source programs that trigger complexity problems in Hindley-Milner style type inference. You could also use compile-time meta programming (like Template Haskell) to generate intermediate code that is well formed but much too big to compile. The fact that a program obeys a published grammar is not sufficient to expect it to compile with a particular implementation (sorry to say).

If I write a broken Template Haskell macro, then yes, I agree. This is not the case in this example.

...

Adding an INLINE pragma is akin to using compile-time meta programming.

Is it? I find that a strange point of view. Isn’t INLINE just a strong hint to the compiler that this function should be inlined? How is using INLINE any different from simply manually inserting the code at every call site? - Adam

Ben Lippmeier

8 Jan 8 Jan

4:30 a.m.

On 08/01/2014, at 10:57 , Adam Wick wrote:

...

I also just performed an experiment in which I took the SHA library, deleted all instances of INLINE and SPECIALIZE, and compiled it with HEAD on 32-bit Linux. You get the same crash. So my usage of SPECIALIZE and INLINE is beside the point.

Ok, then maybe the default inliner heuristics are a bit too eager for this program. Whether that's a bug is open for debate. The standard way of setting such heuristics is to compile a "representative" set of benchmarks (eg, nofib) and choose some settings that give good average performance. I don't think this is an ideal approach, but it's the typical one for compiler engineering.

...

...
Sadly, "valid input" isn't a well defined concept in practice. You could write a "valid" 10GB Haskell source file that obeyed the Haskell standard grammar, but I wouldn't expect that to compile either.

I would. I’m a little disappointed that ghc-devs does not. I wouldn’t expect it to compile quickly, but I would expect it to run.

To satisfy such a demand GHC would need to have linear space usage with respect to the input program size. This implies it must also be linear with respect to the number of top-level declarations, number of variables, number of quantifiers in type sigs, and any other countable thing in the input program. It would also need to be linear for other finite resources that might run out, like symbol table entries. If you had 1Gig top-level foreign exported declarations in the source program I suspect the ELF linker would freak out. I'm not trying to be difficult or argumentative -- I think limits like these come naturally with a concrete implementation. I agree it's sad that client programmers can't just enable -O2 and expect every program to work. It'd be nice to have optimisation levels that give resource or complexity guarantees, like "enabling this won't make the code-size non-linear in the input size", but that's not how it works at the moment. I'm not aware of any compiler for a "high level" language that gives such guarantees, but there may be some. I'd be interested to know of any.

...

...
Adding an INLINE pragma is akin to using compile-time meta programming.

Is it? I find that a strange point of view. Isn’t INLINE just a strong hint to the compiler that this function should be inlined? How is using INLINE any different from simply manually inserting the code at every call site?

It's not a "hint" -- it *forces* inlining at every call site like you said. It'll make a new copy of the function body for every call site, and not back-out if the program gets "too big". Suppose: f x = g x ... g x' ... g x'' g y = h y ... h y' ... h y'' h z = i z ... i z' ... i z'' ... now force inlining for all of f g h etc. I'd expect to see at least 3*3*3=27 copies of the body of 'i' in the core program, and even more if SpecConstr and the LiberateCase transform are turned on. We had (and have) big problems like this with DPH. It took too long for the DPH team to unlearn the dogma that "inlining and call pattern specialisation make the program better". Ben.

Jan Stolarek

7:07 a.m.

...

It's not a "hint" -- it *forces* inlining at every call site like you said. There are exceptions: function must be fully applied to be inlined and there are loop-breakers (e.g. a self-recursive function will not be inlined).

Janek

Carter Schonwald

6:11 a.m.

Adam, I agree that it should be considered a misfeature (or at the very least a good stress test that currently breaks the register allocator). That said, INLINE / INLINEABLE are only needed for intermodule optimization, have you tried using the special "inline" primop selectively, or using INLINEABLE plus selective inline? I think inline should work in the defining module even if you don't provide an INLINE or INLINEABLE. question 1: does the code compile well when you use -fllvm? (seems like the discussion so far has been NCG focused). how does the generated assembly fair that way vs the workaroudn path on NCG? On Tue, Jan 7, 2014 at 6:57 PM, Adam Wick wrote:

...

...
On 07/01/2014, at 9:26 , Adam Wick wrote:

...
...
Not if we just have this one test. I'd be keen to blame excessive use of inline pragmas in the SHA library itself, or excessive optimisation flags. It's not really a bug in GHC until there are two tests that exhibit

...
...
The SHA library uses SPECIALIZE, INLINE, and bang patterns in fairly

standard ways. There’s nothing too exotic in there, I just basically sprinkled hints in places I thought would be useful, and then backed those up with benchmarking.

Ahh. It's the "sprinkled hints in places I thought would be useful" which is what I'm concerned about. If you just add pragmas without understanding their effect on the core program then it'll bite further down

On Jan 7, 2014, at 2:27 AM, Ben Lippmeier wrote: the same problem. the line. Did you compare the object code size as well as wall clock speedup?

I understand the pragmas and what they do with my code. I use SPECIALIZE twice for two functions. In both functions, it was clearer to write the function as (a -> a -> a -> a), but I wanted specialized versions for the two versions that were going to be used, in which (a == Word32) or (a == Word64). This benchmarked as faster while maintaining code clarity and concision. I use INLINE in five places, each of them a SHA step function, with the understanding that it would generate ideal code for a compiler for the performance-critical parts of the algorithm: straight line, single-block code with no conditionals.

When I did my original performance work, several versions of GHC ago, I did indeed consider compile time, runtime performance, and space usage. I picked what I thought was a reasonable balance at the time.

I also just performed an experiment in which I took the SHA library, deleted all instances of INLINE and SPECIALIZE, and compiled it with HEAD on 32-bit Linux. You get the same crash. So my usage of SPECIALIZE and INLINE is beside the point.

...
Sadly, "valid input" isn't a well defined concept in practice. You could write a "valid" 10GB Haskell source file that obeyed the Haskell standard grammar, but I wouldn't expect that to compile either.

I would. I’m a little disappointed that ghc-devs does not. I wouldn’t expect it to compile quickly, but I would expect it to run.

...
You could also write small (< 1k) source programs that trigger complexity problems in Hindley-Milner style type inference. You could also use compile-time meta programming (like Template Haskell) to generate intermediate code that is well formed but much too big to compile. The fact that a program obeys a published grammar is not sufficient to expect it to compile with a particular implementation (sorry to say).

If I write a broken Template Haskell macro, then yes, I agree. This is not the case in this example.

...
Adding an INLINE pragma is akin to using compile-time meta programming.

Is it? I find that a strange point of view. Isn’t INLINE just a strong hint to the compiler that this function should be inlined? How is using INLINE any different from simply manually inserting the code at every call site?

- Adam _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Iavor Diatchki

7:14 a.m.

Hello, I find it a bit perplexing (and not at all constructive) that we are arguing over semantics here. We have a program (1 module, ~1000 lines of "no fancy extension Haskell"), which causes GHC to panic. This is a bug. An invariant that we were assuming did not actually hold. Hence the message that the "impossible" happened. If GHC decides to refuse to compile a program, it should not panic but, rather, explain what happened and maybe suggest a workaround. I am not familiar with GHC's back-end, but it seems that there might be something interesting that's going on here. The SHA library works fine with 7.6.3, and it compiles (admittedly very slowly) using GHC head on my 64-bit machine. So something has changed, and it'd be nice if we understood what's causing the problem. Ben suggested that the issue might be the INLINE pragmas, but clearly that's not the problem, as Adam reproduced the same behavior without those pragmas. If the issue is indeed with the built-in inline heuristics, it sounds like we either should fix the heuristics, or come up with some suggestions about what to avoid in user programs. Or, perhaps, the issue something completely unrelated (e.g., a bug in the register allocator). Either way, I think this deserves a ticket. -Iavor On Tue, Jan 7, 2014 at 10:11 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:

...

Adam, I agree that it should be considered a misfeature (or at the very least a good stress test that currently breaks the register allocator). That said, INLINE / INLINEABLE are only needed for intermodule optimization, have you tried using the special "inline" primop selectively, or using INLINEABLE plus selective inline? I think inline should work in the defining module even if you don't provide an INLINE or INLINEABLE.

question 1: does the code compile well when you use -fllvm? (seems like the discussion so far has been NCG focused). how does the generated assembly fair that way vs the workaroudn path on NCG?

On Tue, Jan 7, 2014 at 6:57 PM, Adam Wick wrote:

...
...
On 07/01/2014, at 9:26 , Adam Wick wrote:

...
...
Not if we just have this one test. I'd be keen to blame excessive use of inline pragmas in the SHA library itself, or excessive optimisation flags. It's not really a bug in GHC until there are two tests that exhibit

...
...
The SHA library uses SPECIALIZE, INLINE, and bang patterns in fairly

standard ways. There’s nothing too exotic in there, I just basically sprinkled hints in places I thought would be useful, and then backed those up with benchmarking.

Ahh. It's the "sprinkled hints in places I thought would be useful" which is what I'm concerned about. If you just add pragmas without understanding their effect on the core program then it'll bite further down

On Jan 7, 2014, at 2:27 AM, Ben Lippmeier wrote: the same problem. the line. Did you compare the object code size as well as wall clock speedup?

I understand the pragmas and what they do with my code. I use SPECIALIZE twice for two functions. In both functions, it was clearer to write the function as (a -> a -> a -> a), but I wanted specialized versions for the two versions that were going to be used, in which (a == Word32) or (a == Word64). This benchmarked as faster while maintaining code clarity and concision. I use INLINE in five places, each of them a SHA step function, with the understanding that it would generate ideal code for a compiler for the performance-critical parts of the algorithm: straight line, single-block code with no conditionals.

When I did my original performance work, several versions of GHC ago, I did indeed consider compile time, runtime performance, and space usage. I picked what I thought was a reasonable balance at the time.

I also just performed an experiment in which I took the SHA library, deleted all instances of INLINE and SPECIALIZE, and compiled it with HEAD on 32-bit Linux. You get the same crash. So my usage of SPECIALIZE and INLINE is beside the point.

...
Sadly, "valid input" isn't a well defined concept in practice. You could write a "valid" 10GB Haskell source file that obeyed the Haskell standard grammar, but I wouldn't expect that to compile either.

I would. I’m a little disappointed that ghc-devs does not. I wouldn’t expect it to compile quickly, but I would expect it to run.

...
You could also write small (< 1k) source programs that trigger complexity problems in Hindley-Milner style type inference. You could also use compile-time meta programming (like Template Haskell) to generate intermediate code that is well formed but much too big to compile. The fact that a program obeys a published grammar is not sufficient to expect it to compile with a particular implementation (sorry to say).

If I write a broken Template Haskell macro, then yes, I agree. This is not the case in this example.

...
Adding an INLINE pragma is akin to using compile-time meta programming.

Is it? I find that a strange point of view. Isn’t INLINE just a strong hint to the compiler that this function should be inlined? How is using INLINE any different from simply manually inserting the code at every call site?

- Adam _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Carter Schonwald

7:35 a.m.

well said iavor. It perhaps hints at the register allocators needing some love? I hope to dig deep into those myself later this year, but maybe it needs some wibbles to clean up for 7.8 right now? On Wed, Jan 8, 2014 at 2:14 AM, Iavor Diatchki wrote:

...

Hello,

I find it a bit perplexing (and not at all constructive) that we are arguing over semantics here. We have a program (1 module, ~1000 lines of "no fancy extension Haskell"), which causes GHC to panic. This is a bug. An invariant that we were assuming did not actually hold. Hence the message that the "impossible" happened. If GHC decides to refuse to compile a program, it should not panic but, rather, explain what happened and maybe suggest a workaround.

I am not familiar with GHC's back-end, but it seems that there might be something interesting that's going on here. The SHA library works fine with 7.6.3, and it compiles (admittedly very slowly) using GHC head on my 64-bit machine. So something has changed, and it'd be nice if we understood what's causing the problem.

Ben suggested that the issue might be the INLINE pragmas, but clearly that's not the problem, as Adam reproduced the same behavior without those pragmas. If the issue is indeed with the built-in inline heuristics, it sounds like we either should fix the heuristics, or come up with some suggestions about what to avoid in user programs. Or, perhaps, the issue something completely unrelated (e.g., a bug in the register allocator). Either way, I think this deserves a ticket.

-Iavor

On Tue, Jan 7, 2014 at 10:11 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:

...
Adam, I agree that it should be considered a misfeature (or at the very least a good stress test that currently breaks the register allocator). That said, INLINE / INLINEABLE are only needed for intermodule optimization, have you tried using the special "inline" primop selectively, or using INLINEABLE plus selective inline? I think inline should work in the defining module even if you don't provide an INLINE or INLINEABLE.

question 1: does the code compile well when you use -fllvm? (seems like the discussion so far has been NCG focused). how does the generated assembly fair that way vs the workaroudn path on NCG?

On Tue, Jan 7, 2014 at 6:57 PM, Adam Wick wrote:

...
...
On 07/01/2014, at 9:26 , Adam Wick wrote:

...
...
Not if we just have this one test. I'd be keen to blame excessive use of inline pragmas in the SHA library itself, or excessive optimisation flags. It's not really a bug in GHC until there are two tests that exhibit

...
...
The SHA library uses SPECIALIZE, INLINE, and bang patterns in fairly

standard ways. There’s nothing too exotic in there, I just basically sprinkled hints in places I thought would be useful, and then backed those up with benchmarking.

Ahh. It's the "sprinkled hints in places I thought would be useful" which is what I'm concerned about. If you just add pragmas without understanding their effect on the core program then it'll bite further down

On Jan 7, 2014, at 2:27 AM, Ben Lippmeier wrote: the same problem. the line. Did you compare the object code size as well as wall clock speedup?

I understand the pragmas and what they do with my code. I use SPECIALIZE twice for two functions. In both functions, it was clearer to write the function as (a -> a -> a -> a), but I wanted specialized versions for the two versions that were going to be used, in which (a == Word32) or (a == Word64). This benchmarked as faster while maintaining code clarity and concision. I use INLINE in five places, each of them a SHA step function, with the understanding that it would generate ideal code for a compiler for the performance-critical parts of the algorithm: straight line, single-block code with no conditionals.

When I did my original performance work, several versions of GHC ago, I did indeed consider compile time, runtime performance, and space usage. I picked what I thought was a reasonable balance at the time.

I also just performed an experiment in which I took the SHA library, deleted all instances of INLINE and SPECIALIZE, and compiled it with HEAD on 32-bit Linux. You get the same crash. So my usage of SPECIALIZE and INLINE is beside the point.

...
Sadly, "valid input" isn't a well defined concept in practice. You could write a "valid" 10GB Haskell source file that obeyed the Haskell standard grammar, but I wouldn't expect that to compile either.

I would. I’m a little disappointed that ghc-devs does not. I wouldn’t expect it to compile quickly, but I would expect it to run.

...
You could also write small (< 1k) source programs that trigger complexity problems in Hindley-Milner style type inference. You could also use compile-time meta programming (like Template Haskell) to generate intermediate code that is well formed but much too big to compile. The fact that a program obeys a published grammar is not sufficient to expect it to compile with a particular implementation (sorry to say).

If I write a broken Template Haskell macro, then yes, I agree. This is not the case in this example.

...
Adding an INLINE pragma is akin to using compile-time meta programming.

Is it? I find that a strange point of view. Isn’t INLINE just a strong hint to the compiler that this function should be inlined? How is using INLINE any different from simply manually inserting the code at every call site?

- Adam _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Simon Marlow

10:37 a.m.

On 08/01/14 07:35, Carter Schonwald wrote:

...

well said iavor. It perhaps hints at the register allocators needing some love? I hope to dig deep into those myself later this year, but maybe it needs some wibbles to clean up for 7.8 right now?

There's a bit of confusion here. Let me try to clarify: - the graph-colouring register allocator now trips the spill slot limit with SHA-1, where it didn't previously. This may be because earlier compiler stages are generating worse code, or it may be because this allocator has bitrotted (see #7679). - The code compiles fine without the flag -fregs-graph. - The limitation on spill slots that existed in all versions prior to 7.8 has been lifted in 7.8, but only for the linear register allocator (the default one that you get without -fregs-graph). So, let's just disable -fregs-graph in 7.8.1. Ben is right that avoiding -fregs-graph doesn't really fix the problem, because we'll probably get crappy code for SHA-1 now. But someone needs to work on -fregs-graph. This ticket is for the performance issue: https://ghc.haskell.org/trac/ghc/ticket/7679 And I just created this one for the spill slot issue: https://ghc.haskell.org/trac/ghc/ticket/8657 Cheers, Simon

...

On Wed, Jan 8, 2014 at 2:14 AM, Iavor Diatchki mailto:iavor.diatchki@gmail.com> wrote:

Hello,

I find it a bit perplexing (and not at all constructive) that we are arguing over semantics here. We have a program (1 module, ~1000 lines of "no fancy extension Haskell"), which causes GHC to panic. This is a bug. An invariant that we were assuming did not actually hold. Hence the message that the "impossible" happened. If GHC decides to refuse to compile a program, it should not panic but, rather, explain what happened and maybe suggest a workaround.

I am not familiar with GHC's back-end, but it seems that there might be something interesting that's going on here. The SHA library works fine with 7.6.3, and it compiles (admittedly very slowly) using GHC head on my 64-bit machine. So something has changed, and it'd be nice if we understood what's causing the problem.

Ben suggested that the issue might be the INLINE pragmas, but clearly that's not the problem, as Adam reproduced the same behavior without those pragmas. If the issue is indeed with the built-in inline heuristics, it sounds like we either should fix the heuristics, or come up with some suggestions about what to avoid in user programs. Or, perhaps, the issue something completely unrelated (e.g., a bug in the register allocator). Either way, I think this deserves a ticket.

-Iavor

On Tue, Jan 7, 2014 at 10:11 PM, Carter Schonwald mailto:carter.schonwald@gmail.com> wrote:

Adam, I agree that it should be considered a misfeature (or at the very least a good stress test that currently breaks the register allocator). That said, INLINE / INLINEABLE are only needed for intermodule optimization, have you tried using the special "inline" primop selectively, or using INLINEABLE plus selective inline? I think inline should work in the defining module even if you don't provide an INLINE or INLINEABLE.

question 1: does the code compile well when you use -fllvm? (seems like the discussion so far has been NCG focused). how does the generated assembly fair that way vs the workaroudn path on NCG?

On Tue, Jan 7, 2014 at 6:57 PM, Adam Wick mailto:awick@galois.com> wrote:

On Jan 7, 2014, at 2:27 AM, Ben Lippmeier mailto:benl@ouroborus.net> wrote: > On 07/01/2014, at 9:26 , Adam Wick mailto:awick@galois.com> wrote: > >>> Not if we just have this one test. I'd be keen to blame excessive use of inline pragmas in the SHA library itself, or excessive optimisation flags. It's not really a bug in GHC until there are two tests that exhibit the same problem. >> >> The SHA library uses SPECIALIZE, INLINE, and bang patterns in fairly standard ways. There’s nothing too exotic in there, I just basically sprinkled hints in places I thought would be useful, and then backed those up with benchmarking. > > Ahh. It's the "sprinkled hints in places I thought would be useful" which is what I'm concerned about. If you just add pragmas without understanding their effect on the core program then it'll bite further down the line. Did you compare the object code size as well as wall clock speedup?

I understand the pragmas and what they do with my code. I use SPECIALIZE twice for two functions. In both functions, it was clearer to write the function as (a -> a -> a -> a), but I wanted specialized versions for the two versions that were going to be used, in which (a == Word32) or (a == Word64). This benchmarked as faster while maintaining code clarity and concision. I use INLINE in five places, each of them a SHA step function, with the understanding that it would generate ideal code for a compiler for the performance-critical parts of the algorithm: straight line, single-block code with no conditionals.

When I did my original performance work, several versions of GHC ago, I did indeed consider compile time, runtime performance, and space usage. I picked what I thought was a reasonable balance at the time.

I also just performed an experiment in which I took the SHA library, deleted all instances of INLINE and SPECIALIZE, and compiled it with HEAD on 32-bit Linux. You get the same crash. So my usage of SPECIALIZE and INLINE is beside the point.

> Sadly, "valid input" isn't a well defined concept in practice. You could write a "valid" 10GB Haskell source file that obeyed the Haskell standard grammar, but I wouldn't expect that to compile either.

I would. I’m a little disappointed that ghc-devs does not. I wouldn’t expect it to compile quickly, but I would expect it to run.

> You could also write small (< 1k) source programs that trigger complexity problems in Hindley-Milner style type inference. You could also use compile-time meta programming (like Template Haskell) to generate intermediate code that is well formed but much too big to compile. The fact that a program obeys a published grammar is not sufficient to expect it to compile with a particular implementation (sorry to say).

If I write a broken Template Haskell macro, then yes, I agree. This is not the case in this example.

> Adding an INLINE pragma is akin to using compile-time meta programming.

Is it? I find that a strange point of view. Isn’t INLINE just a strong hint to the compiler that this function should be inlined? How is using INLINE any different from simply manually inserting the code at every call site?

- Adam _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Ben Lippmeier

12 Jan 12 Jan

4:29 a.m.

On 08/01/2014, at 21:37 , Simon Marlow wrote:

...

Ben is right that avoiding -fregs-graph doesn't really fix the problem, because we'll probably get crappy code for SHA-1 now. But someone needs to work on -fregs-graph.

I wouldn't be offended if you just deleted the code from the repo. Now that the LLVM project is so well supported I don't see much point maintaining a second GHC-specific allocator. Ben.

Austin Seipp

8 Jan 8 Jan

7:43 a.m.

Just a guess, but I imagine it works for you because on amd64 the NCG will have more registers available to satisfy all the live ranges. On 32bit the situation is much worse because that's just how x86 is. In any case, I'm inclined to also agree this is a bug, and merits an open ticket. And it's not like there is no precedent for these things: plenty of libraries (vector-algorithms & others) stress the simplifier e.g. exhausting the simplifier ticks, and we tend to try and work towards fixing these. While INLINE is crucial in vector-algorithms, it probably isn't for the SHA package, and based on what Adam said I do think this perhaps merits some more investigation. I doubt it will get 'fixed' properly in time for 7.8 though - a workaround or something will probably be needed for 32bit builds in some way (perhaps just a few careful refactorings with uses of NOINLINE, although I can't estimate the performance impact.) On Wed, Jan 8, 2014 at 1:14 AM, Iavor Diatchki wrote:

...

Hello,

I find it a bit perplexing (and not at all constructive) that we are arguing over semantics here. We have a program (1 module, ~1000 lines of "no fancy extension Haskell"), which causes GHC to panic. This is a bug. An invariant that we were assuming did not actually hold. Hence the message that the "impossible" happened. If GHC decides to refuse to compile a program, it should not panic but, rather, explain what happened and maybe suggest a workaround.

I am not familiar with GHC's back-end, but it seems that there might be something interesting that's going on here. The SHA library works fine with 7.6.3, and it compiles (admittedly very slowly) using GHC head on my 64-bit machine. So something has changed, and it'd be nice if we understood what's causing the problem.

Ben suggested that the issue might be the INLINE pragmas, but clearly that's not the problem, as Adam reproduced the same behavior without those pragmas. If the issue is indeed with the built-in inline heuristics, it sounds like we either should fix the heuristics, or come up with some suggestions about what to avoid in user programs. Or, perhaps, the issue something completely unrelated (e.g., a bug in the register allocator). Either way, I think this deserves a ticket.

-Iavor

On Tue, Jan 7, 2014 at 10:11 PM, Carter Schonwald wrote:

...
Adam, I agree that it should be considered a misfeature (or at the very least a good stress test that currently breaks the register allocator). That said, INLINE / INLINEABLE are only needed for intermodule optimization, have you tried using the special "inline" primop selectively, or using INLINEABLE plus selective inline? I think inline should work in the defining module even if you don't provide an INLINE or INLINEABLE.

question 1: does the code compile well when you use -fllvm? (seems like the discussion so far has been NCG focused). how does the generated assembly fair that way vs the workaroudn path on NCG?

On Tue, Jan 7, 2014 at 6:57 PM, Adam Wick wrote:

...
On Jan 7, 2014, at 2:27 AM, Ben Lippmeier wrote:

...
On 07/01/2014, at 9:26 , Adam Wick wrote:

...
...
Not if we just have this one test. I'd be keen to blame excessive use of inline pragmas in the SHA library itself, or excessive optimisation flags. It's not really a bug in GHC until there are two tests that exhibit the same problem.

The SHA library uses SPECIALIZE, INLINE, and bang patterns in fairly standard ways. There’s nothing too exotic in there, I just basically sprinkled hints in places I thought would be useful, and then backed those up with benchmarking.

Ahh. It's the "sprinkled hints in places I thought would be useful" which is what I'm concerned about. If you just add pragmas without understanding their effect on the core program then it'll bite further down the line. Did you compare the object code size as well as wall clock speedup?

I understand the pragmas and what they do with my code. I use SPECIALIZE twice for two functions. In both functions, it was clearer to write the function as (a -> a -> a -> a), but I wanted specialized versions for the two versions that were going to be used, in which (a == Word32) or (a == Word64). This benchmarked as faster while maintaining code clarity and concision. I use INLINE in five places, each of them a SHA step function, with the understanding that it would generate ideal code for a compiler for the performance-critical parts of the algorithm: straight line, single-block code with no conditionals.

When I did my original performance work, several versions of GHC ago, I did indeed consider compile time, runtime performance, and space usage. I picked what I thought was a reasonable balance at the time.

I also just performed an experiment in which I took the SHA library, deleted all instances of INLINE and SPECIALIZE, and compiled it with HEAD on 32-bit Linux. You get the same crash. So my usage of SPECIALIZE and INLINE is beside the point.

...
Sadly, "valid input" isn't a well defined concept in practice. You could write a "valid" 10GB Haskell source file that obeyed the Haskell standard grammar, but I wouldn't expect that to compile either.

I would. I’m a little disappointed that ghc-devs does not. I wouldn’t expect it to compile quickly, but I would expect it to run.

...
You could also write small (< 1k) source programs that trigger complexity problems in Hindley-Milner style type inference. You could also use compile-time meta programming (like Template Haskell) to generate intermediate code that is well formed but much too big to compile. The fact that a program obeys a published grammar is not sufficient to expect it to compile with a particular implementation (sorry to say).

If I write a broken Template Haskell macro, then yes, I agree. This is not the case in this example.

...
Adding an INLINE pragma is akin to using compile-time meta programming.

Is it? I find that a strange point of view. Isn’t INLINE just a strong hint to the compiler that this function should be inlined? How is using INLINE any different from simply manually inserting the code at every call site?

- Adam _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

-- Regards, Austin Seipp, Haskell Consultant Well-Typed LLP, http://www.well-typed.com/

Carter Schonwald

8:09 a.m.

one approach for the sha near term fix is to add some CPP such that certain inlines are suppressed on x86_32 perhaps? On Wed, Jan 8, 2014 at 2:43 AM, Austin Seipp wrote:

...

Just a guess, but I imagine it works for you because on amd64 the NCG will have more registers available to satisfy all the live ranges. On 32bit the situation is much worse because that's just how x86 is.

In any case, I'm inclined to also agree this is a bug, and merits an open ticket. And it's not like there is no precedent for these things: plenty of libraries (vector-algorithms & others) stress the simplifier e.g. exhausting the simplifier ticks, and we tend to try and work towards fixing these.

While INLINE is crucial in vector-algorithms, it probably isn't for the SHA package, and based on what Adam said I do think this perhaps merits some more investigation. I doubt it will get 'fixed' properly in time for 7.8 though - a workaround or something will probably be needed for 32bit builds in some way (perhaps just a few careful refactorings with uses of NOINLINE, although I can't estimate the performance impact.)

...
Hello,

I find it a bit perplexing (and not at all constructive) that we are arguing over semantics here. We have a program (1 module, ~1000 lines of "no fancy extension Haskell"), which causes GHC to panic. This is a bug. An invariant that we were assuming did not actually hold. Hence the message that the "impossible" happened. If GHC decides to refuse to compile a program, it should not panic but, rather, explain what happened and maybe suggest a workaround.

I am not familiar with GHC's back-end, but it seems that there might be something interesting that's going on here. The SHA library works fine with 7.6.3, and it compiles (admittedly very slowly) using GHC head on my 64-bit machine. So something has changed, and it'd be nice if we understood what's causing the problem.

Ben suggested that the issue might be the INLINE pragmas, but clearly

...
not the problem, as Adam reproduced the same behavior without those

...
If the issue is indeed with the built-in inline heuristics, it sounds

...
we either should fix the heuristics, or come up with some suggestions about what to avoid in user programs. Or, perhaps, the issue something completely unrelated (e.g., a bug in the register allocator). Either way, I think this deserves a ticket.

-Iavor

On Tue, Jan 7, 2014 at 10:11 PM, Carter Schonwald wrote:

...
Adam, I agree that it should be considered a misfeature (or at the very least

a

...
good stress test that currently breaks the register allocator). That said, INLINE / INLINEABLE are only needed for intermodule optimization, have you tried using the special "inline" primop selectively, or using INLINEABLE plus selective inline? I think inline should work in the defining module even if you don't provide an INLINE or INLINEABLE.

question 1: does the code compile well when you use -fllvm? (seems like the discussion so far has been NCG focused). how does the generated assembly fair that way vs the workaroudn path on NCG?

On Tue, Jan 7, 2014 at 6:57 PM, Adam Wick wrote:

...
On Jan 7, 2014, at 2:27 AM, Ben Lippmeier wrote:

...
On 07/01/2014, at 9:26 , Adam Wick wrote:

...
> Not if we just have this one test. I'd be keen to blame excessive

use

...
...
...
> of inline pragmas in the SHA library itself, or excessive optimisation > flags. It's not really a bug in GHC until there are two tests that exhibit > the same problem.

The SHA library uses SPECIALIZE, INLINE, and bang patterns in fairly standard ways. There’s nothing too exotic in there, I just basically sprinkled hints in places I thought would be useful, and then backed those up with benchmarking.

Ahh. It's the "sprinkled hints in places I thought would be useful" which is what I'm concerned about. If you just add pragmas without understanding their effect on the core program then it'll bite further down the line. Did you compare the object code size as well as wall clock speedup?

I understand the pragmas and what they do with my code. I use SPECIALIZE twice for two functions. In both functions, it was clearer to write the function as (a -> a -> a -> a), but I wanted specialized versions for

...
...
...
two versions that were going to be used, in which (a == Word32) or (a == Word64). This benchmarked as faster while maintaining code clarity and concision. I use INLINE in five places, each of them a SHA step function, with the understanding that it would generate ideal code for a compiler for the performance-critical parts of the algorithm: straight line, single-block code with no conditionals.

When I did my original performance work, several versions of GHC ago, I did indeed consider compile time, runtime performance, and space usage. I picked what I thought was a reasonable balance at the time.

I also just performed an experiment in which I took the SHA library, deleted all instances of INLINE and SPECIALIZE, and compiled it with HEAD on 32-bit Linux. You get the same crash. So my usage of SPECIALIZE and INLINE is beside the point.

...
Sadly, "valid input" isn't a well defined concept in practice. You could write a "valid" 10GB Haskell source file that obeyed the Haskell standard grammar, but I wouldn't expect that to compile either.

I would. I’m a little disappointed that ghc-devs does not. I wouldn’t expect it to compile quickly, but I would expect it to run.

...
You could also write small (< 1k) source programs that trigger complexity problems in Hindley-Milner style type inference. You could also use compile-time meta programming (like Template Haskell) to generate intermediate code that is well formed but much too big to compile. The fact that a program obeys a published grammar is not sufficient to expect it to compile with a particular implementation (sorry to say).

If I write a broken Template Haskell macro, then yes, I agree. This is not the case in this example.

...
Adding an INLINE pragma is akin to using compile-time meta

On Wed, Jan 8, 2014 at 1:14 AM, Iavor Diatchki wrote: that's pragmas. like the programming.

...
...
...
Is it? I find that a strange point of view. Isn’t INLINE just a strong hint to the compiler that this function should be inlined? How is using INLINE any different from simply manually inserting the code at every

call

...
site?

- Adam _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

-- Regards,

Austin Seipp, Haskell Consultant Well-Typed LLP, http://www.well-typed.com/

Simon Peyton Jones

8:23 a.m.

Any time GHC simply falls over, it's a bug. Even if you give GHC a 100 Gbyte input program and try to compile it on a 1 Gbyte machine, it's a bug if GHC simply falls over with "heap exhausted". It would be better if it chopped the program into pieces and compiled them one at a time; or failed with a civilised message like "I'm afraid this input program is too big for me to compile". But some bugs are more pressing than others, and with slender resources we have to concentrate on the ones that affect most people most seriously. There seems to be consensus that this one falls into the "does not affect many people" category, and it has a workaround. So I'm fine with leaving it open, but I think the priority is probably low. Simon | -----Original Message----- | From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of Adam | Wick | Sent: 07 January 2014 23:57 | To: Ben Lippmeier | Cc: ghc-devs@haskell.org | Subject: Re: panic when compiling SHA | | On Jan 7, 2014, at 2:27 AM, Ben Lippmeier wrote: | > On 07/01/2014, at 9:26 , Adam Wick wrote: | > | >>> Not if we just have this one test. I'd be keen to blame excessive | use of inline pragmas in the SHA library itself, or excessive | optimisation flags. It's not really a bug in GHC until there are two | tests that exhibit the same problem. | >> | >> The SHA library uses SPECIALIZE, INLINE, and bang patterns in fairly | standard ways. There's nothing too exotic in there, I just basically | sprinkled hints in places I thought would be useful, and then backed | those up with benchmarking. | > | > Ahh. It's the "sprinkled hints in places I thought would be useful" | which is what I'm concerned about. If you just add pragmas without | understanding their effect on the core program then it'll bite further | down the line. Did you compare the object code size as well as wall | clock speedup? | | I understand the pragmas and what they do with my code. I use SPECIALIZE | twice for two functions. In both functions, it was clearer to write the | function as (a -> a -> a -> a), but I wanted specialized versions for | the two versions that were going to be used, in which (a == Word32) or | (a == Word64). This benchmarked as faster while maintaining code clarity | and concision. I use INLINE in five places, each of them a SHA step | function, with the understanding that it would generate ideal code for a | compiler for the performance-critical parts of the algorithm: straight | line, single-block code with no conditionals. | | When I did my original performance work, several versions of GHC ago, I | did indeed consider compile time, runtime performance, and space usage. | I picked what I thought was a reasonable balance at the time. | | I also just performed an experiment in which I took the SHA library, | deleted all instances of INLINE and SPECIALIZE, and compiled it with | HEAD on 32-bit Linux. You get the same crash. So my usage of SPECIALIZE | and INLINE is beside the point. | | > Sadly, "valid input" isn't a well defined concept in practice. You | could write a "valid" 10GB Haskell source file that obeyed the Haskell | standard grammar, but I wouldn't expect that to compile either. | | I would. I'm a little disappointed that ghc-devs does not. I wouldn't | expect it to compile quickly, but I would expect it to run. | | > You could also write small (< 1k) source programs that trigger | complexity problems in Hindley-Milner style type inference. You could | also use compile-time meta programming (like Template Haskell) to | generate intermediate code that is well formed but much too big to | compile. The fact that a program obeys a published grammar is not | sufficient to expect it to compile with a particular implementation | (sorry to say). | | If I write a broken Template Haskell macro, then yes, I agree. This is | not the case in this example. | | > Adding an INLINE pragma is akin to using compile-time meta | programming. | | Is it? I find that a strange point of view. Isn't INLINE just a strong | hint to the compiler that this function should be inlined? How is using | INLINE any different from simply manually inserting the code at every | call site? | | | - Adam

Simon Peyton-Jones

6 Jan 6 Jan

8:43 a.m.

| Note that removing the flag isn't a "solution" to the underlying problem | of the intermediate code being awful. Switching to the linear allocator | just permits compilation of core code that was worse than before. Now it | needs to spill more registers when compiling the same source code. In what way is the intermediate code awful? How could it be fixed? Worth opening a ticket for that issue? At the moment it's invisible because the issue appears superficially to be about register allocation. Simon | -----Original Message----- | From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of Ben | Lippmeier | Sent: 05 January 2014 10:47 | To: Kazu Yamamoto (山本和彦) | Cc: ghc-devs@haskell.org | Subject: Re: panic when compiling SHA | | | On 04/01/2014, at 23:22 , Kazu Yamamoto (山本和彦) | wrote: | | > Hi, | > | >>> There are only a fixed number of register spill slots, and when | >>> they're all used the compiler can't dynamically allocate more of | >>> them. | >> | >> Not true any more in 7.8+ with the linear allocator. I think it | >> might still be true for the graph allocator, which is sadly suffering | >> from a little bitrot and probably doesn't generate very good code | >> with the new code generator. | >> | >> So, avoiding -fregs-graph should work around this with 7.8. | > | > I confirmed that removing -fregs-graph should work around this with | > 7.8. | | Ok, my mistake. We originally added -fregs-graph when compiling that | module because both allocators had a fixed stack size, but the graph | allocator did a better job of allocation and avoided overflowing the | stack. | | Note that removing the flag isn't a "solution" to the underlying problem | of the intermediate code being awful. Switching to the linear allocator | just permits compilation of core code that was worse than before. Now it | needs to spill more registers when compiling the same source code. | | Ben. | | _______________________________________________ | ghc-devs mailing list | ghc-devs@haskell.org | http://www.haskell.org/mailman/listinfo/ghc-devs

Ben Lippmeier

7 Jan 7 Jan

9:59 a.m.

On 06/01/2014, at 19:43 , Simon Peyton-Jones wrote:

...

| Note that removing the flag isn't a "solution" to the underlying problem | of the intermediate code being awful. Switching to the linear allocator | just permits compilation of core code that was worse than before. Now it | needs to spill more registers when compiling the same source code.

In what way is the intermediate code awful?

Because the error message from the register allocator tells us that there are over 1000 live variables at a particular point the assembly code, but the "biggest" SHA hashing algorithm (SHA-3) should only need to maintain 25 words of state (says Wikipedia).

...

How could it be fixed?

Someone that cares enough about the SHA library would need to understand why it's producing the intermediate code it does. My gentle suggestion is that when a library developer starts adding INLINE pragmas to their program it becomes their job to understand why the intermediate code is how it is.

...

Worth opening a ticket for that issue? At the moment it's invisible because the issue appears superficially to be about register allocation.

I'd open a ticket against the SHA library saying the choice of optimisation flags / pragmas is probably causing code explosion during compilation. If the developer then decides this is really a problem in GHC I'd want some description of what core transforms they need to happen to achieve good performance. The strategy of "inline everything and hope for the best" is understandable (I've used it!) but only gets you so far... The bug report is like someone saying "GHC can't compile my 100MB core program". You can either open a ticket against GHC, or ask "why have you got a 100MB core program?" Ben.

Simon Marlow

8 Jan 8 Jan

10:42 a.m.

On 07/01/14 09:59, Ben Lippmeier wrote:

...

On 06/01/2014, at 19:43 , Simon Peyton-Jones wrote:

...
| Note that removing the flag isn't a "solution" to the underlying problem | of the intermediate code being awful. Switching to the linear allocator | just permits compilation of core code that was worse than before. Now it | needs to spill more registers when compiling the same source code.

In what way is the intermediate code awful?

Because the error message from the register allocator tells us that there are over 1000 live variables at a particular point the assembly code, but the "biggest" SHA hashing algorithm (SHA-3) should only need to maintain 25 words of state (says Wikipedia).

Neither of the register allocators reuse spill slots for variables that have disjoint live ranges, so the fact that we ran out of spill slots is not necessarily indicative of terrible code (but I agree that it's a strong hint). Cheers, Simon

Adam Wick

9 Jan 9 Jan

7:17 p.m.

On Jan 8, 2014, at 2:42 AM, Simon Marlow wrote:

...

Neither of the register allocators reuse spill slots for variables that have disjoint live ranges, so the fact that we ran out of spill slots is not necessarily indicative of terrible code (but I agree that it's a strong hint).

That’s the problem with SHA, then. The implementation (and the spec, really) is essentially a long combination of the form: let x_n5 = small_computation x_n1 x_n2 x_n3 x_n4 x_n6 = small_computation x_n2 x_n3 x_n4 x_n5 … Which has ~70 entries. The actual number of live variables alive at any time should be relatively small, but if slots aren’t getting reused there’s going to be some significant blowup. (To be honest, I had figured — and thought I had validated — that doing it this way would give the compiler the best chance at generating optimal code, but it appears I merely set myself up to hit this limitation several years later.) - Adam

George Colpitts

8:08 p.m.

Does LLVM have the same limitation that its register allocator does not reuse spill slots for variables that have disjoint live ranges? If not, could the library be compiled with llvm? On Thu, Jan 9, 2014 at 3:17 PM, Adam Wick wrote:

...

On Jan 8, 2014, at 2:42 AM, Simon Marlow wrote:

Neither of the register allocators reuse spill slots for variables that have disjoint live ranges, so the fact that we ran out of spill slots is not necessarily indicative of terrible code (but I agree that it's a strong hint).

That’s the problem with SHA, then. The implementation (and the spec, really) is essentially a long combination of the form:

let x_n5 = small_computation x_n1 x_n2 x_n3 x_n4 x_n6 = small_computation x_n2 x_n3 x_n4 x_n5 …

Which has ~70 entries. The actual number of live variables alive at any time should be relatively small, but if slots aren’t getting reused there’s going to be some significant blowup. (To be honest, I had figured — and thought I had validated — that doing it this way would give the compiler the best chance at generating optimal code, but it appears I merely set myself up to hit this limitation several years later.)

- Adam

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Carter Schonwald

8:22 p.m.

george, the problem with that is not all targets that ghc support have an llvm backend. (in fact, one problem I hope to help resolve (not in full mind you) for 7.10 is that theres a semi disjoint coverage across the backends) On Thu, Jan 9, 2014 at 3:08 PM, George Colpitts wrote:

...

Does LLVM have the same limitation that its register allocator does not reuse spill slots for variables that have disjoint live ranges? If not, could the library be compiled with llvm?

On Thu, Jan 9, 2014 at 3:17 PM, Adam Wick wrote:

...
On Jan 8, 2014, at 2:42 AM, Simon Marlow wrote:

Neither of the register allocators reuse spill slots for variables that have disjoint live ranges, so the fact that we ran out of spill slots is not necessarily indicative of terrible code (but I agree that it's a strong hint).

That’s the problem with SHA, then. The implementation (and the spec, really) is essentially a long combination of the form:

let x_n5 = small_computation x_n1 x_n2 x_n3 x_n4 x_n6 = small_computation x_n2 x_n3 x_n4 x_n5 …

Which has ~70 entries. The actual number of live variables alive at any time should be relatively small, but if slots aren’t getting reused there’s going to be some significant blowup. (To be honest, I had figured — and thought I had validated — that doing it this way would give the compiler the best chance at generating optimal code, but it appears I merely set myself up to hit this limitation several years later.)

- Adam

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Simon Marlow

8:49 p.m.

On 09/01/14 20:08, George Colpitts wrote:

...

Does LLVM have the same limitation that its register allocator does not reuse spill slots for variables that have disjoint live ranges? If not, could the library be compiled with llvm?

Let me reiterate: the SHA-1 library compiles just fine, provided the -fregs-graph flag is not used with GHC 7.8.1. As far as I know it compiles when using the LLVM backend too, but you don't have to use LLVM: the NCG works fine. Cheers, Simon

...

On Thu, Jan 9, 2014 at 3:17 PM, Adam Wick mailto:awick@galois.com> wrote:

On Jan 8, 2014, at 2:42 AM, Simon Marlow mailto:marlowsd@gmail.com> wrote:

...
Neither of the register allocators reuse spill slots for variables that have disjoint live ranges, so the fact that we ran out of spill slots is not necessarily indicative of terrible code (but I agree that it's a strong hint).

That’s the problem with SHA, then. The implementation (and the spec, really) is essentially a long combination of the form:

let x_n5 = small_computation x_n1 x_n2 x_n3 x_n4 x_n6 = small_computation x_n2 x_n3 x_n4 x_n5 …

Which has ~70 entries. The actual number of live variables alive at any time should be relatively small, but if slots aren’t getting reused there’s going to be some significant blowup. (To be honest, I had figured — and thought I had validated — that doing it this way would give the compiler the best chance at generating optimal code, but it appears I merely set myself up to hit this limitation several years later.)

- Adam

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Ben Lippmeier

12 Jan 12 Jan

4:46 a.m.

On 10/01/2014, at 6:17 , Adam Wick wrote:

...

On Jan 8, 2014, at 2:42 AM, Simon Marlow wrote:

...
Neither of the register allocators reuse spill slots for variables that have disjoint live ranges, so the fact that we ran out of spill slots is not necessarily indicative of terrible code (but I agree that it's a strong hint).

Right, I'm starting to remember more now -- it was a while ago. There are some notes here under the section "SHA from darcs", which I assume is the same code. https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Backends/NCG/Regis... The notes say the Cmm code had 30 or so live variables at a particular point, but live ranges of 1700 instructions. I remember I had to change the heuristic that chooses which register to spill, to select the one with the longest live range -- instead of using the standard one from Chatin's algorithm. With the standard heuristic the allocator was taking too long to converge.

...

That’s the problem with SHA, then. The implementation (and the spec, really) is essentially a long combination of the form:

let x_n5 = small_computation x_n1 x_n2 x_n3 x_n4 x_n6 = small_computation x_n2 x_n3 x_n4 x_n5 …

Which has ~70 entries. The actual number of live variables alive at any time should be relatively small, but if slots aren’t getting reused there’s going to be some significant blowup. (To be honest, I had figured — and thought I had validated — that doing it this way would give the compiler the best chance at generating optimal code, but it appears I merely set myself up to hit this limitation several years later.)

If you really end up with 70 copies of small_computation in the object code then that's not very friendly to the L1 instruction cache -- though perhaps it doesn't matter if the processor will be stalled reading the input data most of the time anyway. The -O2 heuristics might inline small_computation by default, assuming it's non-recursive. Ben.

Ben Lippmeier

13 Jan 13 Jan

3:06 a.m.

On 10/01/2014, at 6:17 , Adam Wick wrote:

...

That’s the problem with SHA, then. The implementation (and the spec, really) is essentially a long combination of the form:

let x_n5 = small_computation x_n1 x_n2 x_n3 x_n4 x_n6 = small_computation x_n2 x_n3 x_n4 x_n5 …

Which has ~70 entries. The actual number of live variables alive at any time should be relatively small, but if slots aren’t getting reused there’s going to be some significant blowup. (To be honest, I had figured — and thought I had validated — that doing it this way would give the compiler the best chance at generating optimal code, but it appears I merely set myself up to hit this limitation several years later.)

If this [1] is the current version then I don't think there is any performance reason to manually unroll the loops like that. If you write a standard tail-recursive loop then the branch predictor in the processor should make the correct prediction for all iterations except the last one. You'll get one pipeline stall at the end due to a mis-predicted backward branch, but it won't matter in terms of absolute percentage of execution time. You generally only need to worry about branches if the branch flag flips between True and False frequently. If you care deeply about performance then on some processors it can be helpful to unfold this sort of code so that the SHA constants are represented as literals in the instruction stream instead of in static data memory -- but that ability is very processor specific and you'd need to really stare at the emitted assembly code to see if it's worthwhile. Ben. [1] https://github.com/GaloisInc/SHA/blob/master/Data/Digest/Pure/SHA.hs

Carter Schonwald

3:26 a.m.

agreed, this level of unrolling would cause problems even in most C compilers! When I write unrolled simd C code, I use a fixed number of variables that corresponds to the # of registers that are live on my target arch (yes, internally the turn into SSA which then does an ANF/CPS style and such), but by shadowing/resuing a fixed number of names, i can make it "syntactically" clear what the lifetimes of my variables is. But yeah, branch predictors are pretty good on modern hardware, a loop is worth considering. On Sun, Jan 12, 2014 at 10:06 PM, Ben Lippmeier wrote:

...

On 10/01/2014, at 6:17 , Adam Wick wrote:

...
That’s the problem with SHA, then. The implementation (and the spec, really) is essentially a long combination of the form:

let x_n5 = small_computation x_n1 x_n2 x_n3 x_n4 x_n6 = small_computation x_n2 x_n3 x_n4 x_n5 …

Which has ~70 entries. The actual number of live variables alive at any time should be relatively small, but if slots aren’t getting reused there’s going to be some significant blowup. (To be honest, I had figured — and thought I had validated — that doing it this way would give the compiler the best chance at generating optimal code, but it appears I merely set myself up to hit this limitation several years later.)

If this [1] is the current version then I don't think there is any performance reason to manually unroll the loops like that. If you write a standard tail-recursive loop then the branch predictor in the processor should make the correct prediction for all iterations except the last one. You'll get one pipeline stall at the end due to a mis-predicted backward branch, but it won't matter in terms of absolute percentage of execution time. You generally only need to worry about branches if the branch flag flips between True and False frequently.

If you care deeply about performance then on some processors it can be helpful to unfold this sort of code so that the SHA constants are represented as literals in the instruction stream instead of in static data memory -- but that ability is very processor specific and you'd need to really stare at the emitted assembly code to see if it's worthwhile.

Ben.

[1] https://github.com/GaloisInc/SHA/blob/master/Data/Digest/Pure/SHA.hs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

4186

Age (days ago)

4203

Last active (days ago)

List overview

Download

32 comments

12 participants

participants (12)

Adam Wick
Austin Seipp
Ben Lippmeier
Carter Schonwald
George Colpitts
Iavor Diatchki
Jan Stolarek
Kazu Yamamoto
Manuel Gómez
Simon Marlow
Simon Peyton Jones
Simon Peyton-Jones