[GHC] #8279: bad alignment in code gen yields substantial perf issue

#8279: bad alignment in code gen yields substantial perf issue ------------------------------+-------------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime performance bug Unknown/Multiple | Test Case: Difficulty: Unknown | Blocking: Blocked By: | Related Tickets: | ------------------------------+-------------------------------------------- independently, a number of folks have noticed that in various ways, GHC currently has quite a few different memory alignment related performance problems that can have >= 10% perf impact! Nicolas Frisby notes {{{ On my laptop, a program showed a consistent slowdown with -fdicts-strict I didn't find any obvious causes in the Core differences, so I turned to Intel's Performance Counter Monitor for measurements. After trying a few counters, I eventuall saw that there are about an order of magnitude more misaligned memory loads with -fdicts-strict than without, so I think that may be a significant part of the slowdown. I'm not sure if these are code or data reads. Can anyone suggest how to validate this hypothesis about misaligned reads? A subsequent commit has changed the behavior I was seeing, so I'm not interested in alternatives means to determine if -fdicts-strict is somehow at fault — I'm just asking specifically about data/code memory alignment in GHC and how to diagnose/experiment with it. }}} Reid Barton has independently noted {{{ so I did a nofib run with llvm libraries, ghc quickbuild so there's this really simple benchmark tak, https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs it doesn't use any libraries at all in the main loop because the Ints all get unboxed but it's still 8% slower with quick-llvm (vs -fasm) weird right? [14:36:30] <carter> could you post the asm it generates for that function? [14:36:49] <rwbarton> well it's identical between the two versions <rwbarton> but they get linked at different offsets because some llvm sections are different sizes <rwbarton> if I add a 128-byte symbol to the .text section to move it to the same address... then the llvm libs version is just as fast <rwbarton> well, apparently 404000 is good and 403f70 is bad <rwbarton> I guess I can test other alignments easily enough <rwbarton> I imagine it wants to start on a cache line <rwbarton> but I don't know if it's just a coincidence that it worked with the ncg libraries <rwbarton> that it got a good location <rwbarton> for this program every 32-byte aligned address is 10+% faster than any merely 16-byte aligned address <rwbarton> and by alignment I mean alignment of the start of the Haskell code section <carter> haswell, sandybridge, ivy bridge, other? <rwbarton> dunno <rwbarton> I have similar results on Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz <rwbarton> and on Quad-Core AMD Opteron(tm) Processor 2374 HE <carter> ok <rwbarton> trying a patch now that aligns all *_entry symbols to 32 bytes }}} the key point in there is that on the tak benchmark, better alignment for the code made a 10% perf differnce on TAk on Core2 and opteron cpus! benjamin scarlet and Luite are speculating that this may be further induced by Tables next to code (TNC) accidentally creating bad alignment so theres cache line pollution / conflicts between the L1 Instruction- cache and data-caches. So one experiment would be to have the TNC transform pad after the table so the function entry point starts on the next cacheline? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by ezyang): The bad cache behavior of tables next to code has been observed in http://njn.valgrind.org/pubs/cache-large-lazy2002.ps (some of the discussion there is out-of-date, as it predates dynamic pointer tagging). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by carter): Nathan howell notes that its ok for L1 I and data caches to have the same cache lines, so the issue's probably just having good alignment -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by carter): anyways: any exploration of this will require thorough benchmarking / nofib exercise, ideally on a diversity of CPU variants / microarchitectures -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by rwbarton): The "align *_entry to 32 bytes" patch helped tak a lot (~10%), but made no noticeable difference on average over nofib (lots of random changes roughly in the range -3% to 3%). I don't really understand how shifting the address of code by 16 bytes can have such a drastic effect on performance. I guess it must have to do with cache lines, but is using one more cache line really so awful? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by carter): @rwbarton, good question! I think we'll just have to figure out some more systematic experiments, and try to get good measurements on a variety of recent cpu micro architectures, to understand this better. Hopefully something we can explore once 7.9 dev gets afoot! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonmar): I've noticed things like this in the past. One hypothesis is that when the code is tightly packed together the branch predictor doesn't work so well, but that's just a guess. The way forward is to do systematic measurements with all of nofib, measuring performance counters and code size with various alignment strategies. Also someone could thoroughly read the latest Intel optimization guides and see if we're doing the right things. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #8082 --------------------------------------------+------------------------------ Changes (by jstolarek): * related: => #8082 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #8082 --------------------------------------------+------------------------------ Changes (by bgamari): * cc: bgamari@… (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #8082 --------------------------------------------+------------------------------ Comment (by schyler): Little bit of research uncovers that Intel recommends aligning code and branch targets on 16-byte boundaries: `3.4.1.5 - Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16-byte aligned.` The reasons for this are as follows; * Aligning to a 16-byte boundary means that it's more likely loops will fall inside a single cacheline rather than inside 2. Falling on a boundary in a loop has a really noticeable negative performance impact. * Forward small relative jumps that lie inside the same cache line are handled more efficiently by the pipeline than far jumps Note that we have to be really careful not to turn a small relative jump into a larger jump by altering alignment slightly, because these tend to be slower (see pt. 2). Tail calls may benefit from aligning both the top of tables (for aligned reading) and the top of actual function entry points (for aligned iteration) to 16 bytes. This requires experimentation and benchmarking. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #8082 --------------------------------------------+------------------------------ Changes (by schyler): * cc: schyler (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: highest | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #8082 --------------------------------------------+------------------------------ Comment (by simonmar): So the problem with 16-byte aligning branch targets is that many of our code blocks have a 3-word info table. We would have to pad these info tables by one word in addition to aligning to 16 bytes. That might not be too bad, but someone needs to do the measurements to see what the code size / speed tradeoff is. Also we don't necessarily want to align all our labels, because many of them are just heap-check failure targets and wouldn't benefit from aligning at all. I tend to optimise for small binary sizes because I was brought up on computers with 32K of memory and I think you should never waste a byte :-) If you think binary size can be won elsewhere, please do it :-P -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #8082 --------------------------------------------+------------------------------ Changes (by hvr): * cc: hvr (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #8082 --------------------------------------------+------------------------------ Comment (by George): Is this true for both the new code generator and llvm? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #8082 --------------------------------------------+------------------------------ Changes (by George): * cc: george.colpitts@… (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #8082 --------------------------------------------+------------------------------ Description changed by jstolarek: Old description:
independently, a number of folks have noticed that in various ways, GHC currently has quite a few different memory alignment related performance problems that can have >= 10% perf impact!
Nicolas Frisby notes
{{{ On my laptop, a program showed a consistent slowdown with -fdicts-strict
I didn't find any obvious causes in the Core differences, so I turned to Intel's Performance Counter Monitor for measurements. After trying a few counters, I eventuall saw that there are about an order of magnitude more misaligned memory loads with -fdicts-strict than without, so I think that may be a significant part of the slowdown. I'm not sure if these are code or data reads.
Can anyone suggest how to validate this hypothesis about misaligned reads?
A subsequent commit has changed the behavior I was seeing, so I'm not interested in alternatives means to determine if -fdicts-strict is somehow at fault — I'm just asking specifically about data/code memory alignment in GHC and how to diagnose/experiment with it.
}}}
Reid Barton has independently noted {{{
so I did a nofib run with llvm libraries, ghc quickbuild
so there's this really simple benchmark tak, https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs it doesn't use any libraries at all in the main loop because the Ints all get unboxed but it's still 8% slower with quick-llvm (vs -fasm) weird right?
[14:36:30] <carter> could you post the asm it generates for that function? [14:36:49] <rwbarton> well it's identical between the two versions <rwbarton> but they get linked at different offsets because some llvm sections are different sizes <rwbarton> if I add a 128-byte symbol to the .text section to move it to the same address... then the llvm libs version is just as fast <rwbarton> well, apparently 404000 is good and 403f70 is bad <rwbarton> I guess I can test other alignments easily enough <rwbarton> I imagine it wants to start on a cache line <rwbarton> but I don't know if it's just a coincidence that it worked with the ncg libraries <rwbarton> that it got a good location
<rwbarton> for this program every 32-byte aligned address is 10+% faster than any merely 16-byte aligned address
<rwbarton> and by alignment I mean alignment of the start of the Haskell code section <carter> haswell, sandybridge, ivy bridge, other? <rwbarton> dunno <rwbarton> I have similar results on Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz <rwbarton> and on Quad-Core AMD Opteron(tm) Processor 2374 HE <carter> ok <rwbarton> trying a patch now that aligns all *_entry symbols to 32 bytes
}}}
the key point in there is that on the tak benchmark, better alignment for the code made a 10% perf differnce on TAk on Core2 and opteron cpus!
benjamin scarlet and Luite are speculating that this may be further induced by Tables next to code (TNC) accidentally creating bad alignment so theres cache line pollution / conflicts between the L1 Instruction- cache and data-caches. So one experiment would be to have the TNC transform pad after the table so the function entry point starts on the next cacheline?
New description: independently, a number of folks have noticed that in various ways, GHC currently has quite a few different memory alignment related performance problems that can have >= 10% perf impact! Nicolas Frisby notes {{{ On my laptop, a program showed a consistent slowdown with -fdicts-strict I didn't find any obvious causes in the Core differences, so I turned to Intel's Performance Counter Monitor for measurements. After trying a few counters, I eventually saw that there are about an order of magnitude more misaligned memory loads with -fdicts-strict than without, so I think that may be a significant part of the slowdown. I'm not sure if these are code or data reads. Can anyone suggest how to validate this hypothesis about misaligned reads? A subsequent commit has changed the behavior I was seeing, so I'm not interested in alternatives means to determine if -fdicts-strict is somehow at fault — I'm just asking specifically about data/code memory alignment in GHC and how to diagnose/experiment with it. }}} Reid Barton has independently noted {{{ so I did a nofib run with llvm libraries, ghc quickbuild so there's this really simple benchmark tak, https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs it doesn't use any libraries at all in the main loop because the Ints all get unboxed but it's still 8% slower with quick-llvm (vs -fasm) weird right? [14:36:30] <carter> could you post the asm it generates for that function? [14:36:49] <rwbarton> well it's identical between the two versions <rwbarton> but they get linked at different offsets because some llvm sections are different sizes <rwbarton> if I add a 128-byte symbol to the .text section to move it to the same address... then the llvm libs version is just as fast <rwbarton> well, apparently 404000 is good and 403f70 is bad <rwbarton> I guess I can test other alignments easily enough <rwbarton> I imagine it wants to start on a cache line <rwbarton> but I don't know if it's just a coincidence that it worked with the ncg libraries <rwbarton> that it got a good location <rwbarton> for this program every 32-byte aligned address is 10+% faster than any merely 16-byte aligned address <rwbarton> and by alignment I mean alignment of the start of the Haskell code section <carter> haswell, sandybridge, ivy bridge, other? <rwbarton> dunno <rwbarton> I have similar results on Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz <rwbarton> and on Quad-Core AMD Opteron(tm) Processor 2374 HE <carter> ok <rwbarton> trying a patch now that aligns all *_entry symbols to 32 bytes }}} the key point in there is that on the tak benchmark, better alignment for the code made a 10% perf differnce on TAk on Core2 and opteron cpus! benjamin scarlet and Luite are speculating that this may be further induced by Tables next to code (TNC) accidentally creating bad alignment so theres cache line pollution / conflicts between the L1 Instruction- cache and data-caches. So one experiment would be to have the TNC transform pad after the table so the function entry point starts on the next cacheline? -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue --------------------------------------------+------------------------------ Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: #8082 --------------------------------------------+------------------------------ Comment (by simonpj): If we knew that better alignment would improve speed at the cost of binary size (something that might differ across different architectures), that would be more motivation for a flag `-fchoose-speed-over-binary-size`. Another project for someone! Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Runtime | Blocked By: performance bug | Related Tickets: #8082 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by archblob): * cc: archblob (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Runtime | Blocked By: performance bug | Related Tickets: #8082 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by carter): * priority: high => normal * milestone: 7.10.1 => 7.12.1 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by gidyn): * cc: gideon@… (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by gidyn): * cc: gideon@… (removed) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by gidyn): * cc: gidyn (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Rev(s): -------------------------------------+------------------------------------- Comment (by rwbarton): Some related reading material: * "Producing Wrong Data Without Doing Anything Obviously Wrong!", http://sape.inf.usi.ch/publications/asplos09 * Stabilizer, http://plasma.cs.umass.edu/emery/stabilizer * MAO - an Extensible Micro-Architectural Optimizer, http://research.google.com/pubs/pub37077.html -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by George): re Simon PJ's comment 17, would it be relatively easy to do this with the llvm compiler? AFAIK opt and llc don't seem to have the appropriate options but clang has -Os and -Oz (see [https://clang.llvm.org/docs/CommandGuide/clang.html]) so I would hope it wouldn't be too hard to do in the Haskell llvm compiler. Also wrt llvm there is mention of alignment here [http://llvm.org/docs/Frontend/PerformanceTips.html]. I'm sure most people interested in the llvm compiler are aware of this doc but I thought I may as well mention it to be complete. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): I hit this today in `CSD` (kernel is basically a counting loop), with a delta of 12%. Supplying `-fllvm` swapped symptoms for me, still a delta of ~-8%. This was 1a0a971b76c0b717794af9af4e27dcb488924800 vs https://github.com/sgraf812/ghc/tree/25bba2b92b9bcdc090ae9418a0425ea0a829491.... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by sgraf): * cc: sgraf (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8279: bad alignment in code gen yields substantial perf issue -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by George): re Simon PJ's and Simon Mar's comment , we don't quite know that that better alignment would improve speed at the cost of binary size , but the Intel documentation is a pretty strong hint that it proably will and is thus worth checking out, right? Couldn't we break this into two tasks: First to provide a patch with instructions on how to apply it e.g. to 8.6.1 or head or whatever and a task to benchmark the resulting compiler. The benchmarking task could be done by people with much less, or at leasts different, expertise than it takes to produce such a patch right? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8279#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC