[GHC] #8900: unordered-containers 19% slower in HEAD vs 7.6.3

#8900: unordered-containers 19% slower in HEAD vs 7.6.3 ------------------------------+-------------------------------------------- Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Keywords: | Operating System: MacOS X Architecture: x86_64 | Type of failure: Runtime performance bug (amd64) | Test Case: Difficulty: Unknown | Blocking: Blocked By: | Related Tickets: | ------------------------------+-------------------------------------------- I ran a simple benchmark that exercises `Data.HashMap.Lazy.insert`. It's 19% slower using HEAD compared to using 7.6.3. The generated Core is the same, but the new codegen generates substantially different Cmm. '''Steps to reproduce''' 1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD: {{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}} 3. Compile the benchmark with both compilers: {{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}} '''Results (best of 3 runs)''' 7.6.3 {{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed) %GC time 34.8% (40.3% elapsed) Alloc rate 4,923,204,681 bytes per MUT second Productivity 65.2% of total user, 59.0% of total elapsed }}} HEAD: {{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed) %GC time 41.6% (47.5% elapsed) Alloc rate 4,738,791,249 bytes per MUT second Productivity 58.3% of total user, 51.9% of total elapsed }}} (Note that this is without the patches in #8885, so they're not the cause.) An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: unordered-containers 16% slower in HEAD vs 7.6.3 --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Description changed by tibbe: Old description:
I ran a simple benchmark that exercises `Data.HashMap.Lazy.insert`. It's 19% slower using HEAD compared to using 7.6.3. The generated Core is the same, but the new codegen generates substantially different Cmm.
'''Steps to reproduce'''
1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD:
{{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}}
3. Compile the benchmark with both compilers:
{{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}}
'''Results (best of 3 runs)'''
7.6.3
{{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s
INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed)
%GC time 34.8% (40.3% elapsed)
Alloc rate 4,923,204,681 bytes per MUT second
Productivity 65.2% of total user, 59.0% of total elapsed }}}
HEAD:
{{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s
INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed)
%GC time 41.6% (47.5% elapsed)
Alloc rate 4,738,791,249 bytes per MUT second
Productivity 58.3% of total user, 51.9% of total elapsed }}}
(Note that this is without the patches in #8885, so they're not the cause.)
An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related.
New description: I ran a simple benchmark that exercises `Data.HashMap.Lazy.insert`. It's 16% slower using HEAD compared to using 7.6.3. The generated Core is the same, but the new codegen generates substantially different Cmm. '''Steps to reproduce''' 1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD: {{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}} 3. Compile the benchmark with both compilers: {{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}} '''Results (best of 3 runs)''' 7.6.3 {{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed) %GC time 34.8% (40.3% elapsed) Alloc rate 4,923,204,681 bytes per MUT second Productivity 65.2% of total user, 59.0% of total elapsed }}} HEAD: {{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed) %GC time 41.6% (47.5% elapsed) Alloc rate 4,738,791,249 bytes per MUT second Productivity 58.3% of total user, 51.9% of total elapsed }}} (Note that this is without the patches in #8885, so they're not the cause.) An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: unordered-containers 16% slower in HEAD vs 7.6.3 --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Changes (by tibbe): * version: 7.6.3 => 7.9 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: unordered-containers 16% slower in HEAD vs 7.6.3 --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): I also see some changes in the Core. I've attached both the Core and optimized Cmm output. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: unordered-containers 16% slower in HEAD vs 7.6.3 --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Description changed by tibbe: Old description:
I ran a simple benchmark that exercises `Data.HashMap.Lazy.insert`. It's 16% slower using HEAD compared to using 7.6.3. The generated Core is the same, but the new codegen generates substantially different Cmm.
'''Steps to reproduce'''
1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD:
{{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}}
3. Compile the benchmark with both compilers:
{{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}}
'''Results (best of 3 runs)'''
7.6.3
{{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s
INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed)
%GC time 34.8% (40.3% elapsed)
Alloc rate 4,923,204,681 bytes per MUT second
Productivity 65.2% of total user, 59.0% of total elapsed }}}
HEAD:
{{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s
INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed)
%GC time 41.6% (47.5% elapsed)
Alloc rate 4,738,791,249 bytes per MUT second
Productivity 58.3% of total user, 51.9% of total elapsed }}}
(Note that this is without the patches in #8885, so they're not the cause.)
An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related.
New description: I ran a simple benchmark that exercises [https://github.com/tibbe /unordered-containers/blob/master/Data/HashMap/Base.hs#L303 Data.HashMap.Lazy.insert]. It's 16% slower using HEAD compared to using 7.6.3. The generated Core is the same, but the new codegen generates substantially different Cmm. '''Steps to reproduce''' 1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD: {{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}} 3. Compile the benchmark with both compilers: {{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}} '''Results (best of 3 runs)''' 7.6.3 {{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed) %GC time 34.8% (40.3% elapsed) Alloc rate 4,923,204,681 bytes per MUT second Productivity 65.2% of total user, 59.0% of total elapsed }}} HEAD: {{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed) %GC time 41.6% (47.5% elapsed) Alloc rate 4,738,791,249 bytes per MUT second Productivity 58.3% of total user, 51.9% of total elapsed }}} (Note that this is without the patches in #8885, so they're not the cause.) An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: unordered-containers 16% slower in HEAD vs 7.6.3 --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Description changed by tibbe: Old description:
I ran a simple benchmark that exercises [https://github.com/tibbe /unordered-containers/blob/master/Data/HashMap/Base.hs#L303 Data.HashMap.Lazy.insert]. It's 16% slower using HEAD compared to using 7.6.3. The generated Core is the same, but the new codegen generates substantially different Cmm.
'''Steps to reproduce'''
1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD:
{{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}}
3. Compile the benchmark with both compilers:
{{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}}
'''Results (best of 3 runs)'''
7.6.3
{{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s
INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed)
%GC time 34.8% (40.3% elapsed)
Alloc rate 4,923,204,681 bytes per MUT second
Productivity 65.2% of total user, 59.0% of total elapsed }}}
HEAD:
{{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s
INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed)
%GC time 41.6% (47.5% elapsed)
Alloc rate 4,738,791,249 bytes per MUT second
Productivity 58.3% of total user, 51.9% of total elapsed }}}
(Note that this is without the patches in #8885, so they're not the cause.)
An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related.
New description: I ran a simple benchmark that exercises [https://github.com/tibbe /unordered-containers/blob/master/Data/HashMap/Base.hs#L303 Data.HashMap.Lazy.insert]. It's 16% slower using HEAD compared to using 7.6.3. The generated Core is a bit different and the generated Cmm is quite a bit different. '''Steps to reproduce''' 1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD: {{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}} 3. Compile the benchmark with both compilers: {{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}} '''Results (best of 3 runs)''' 7.6.3 {{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed) %GC time 34.8% (40.3% elapsed) Alloc rate 4,923,204,681 bytes per MUT second Productivity 65.2% of total user, 59.0% of total elapsed }}} HEAD: {{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed) %GC time 41.6% (47.5% elapsed) Alloc rate 4,738,791,249 bytes per MUT second Productivity 58.3% of total user, 51.9% of total elapsed }}} (Note that this is without the patches in #8885, so they're not the cause.) An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: unordered-containers 16% slower in HEAD vs 7.6.3 --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonmar): So we have {{{ MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) vs MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) }}} ie. almost all the difference is in GC. And: {{{ 141,978,520 bytes copied during GC 99 MB total memory in use (0 MB lost due to fragmentation) vs 231,158,688 bytes copied during GC 144 MB total memory in use (0 MB lost due to fragmentation) }}} And {{{ Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s vs Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s }}} My guess, based on seeing things like this before, is that the benchmark is very sensitive to when exactly major GC strikes - perhaps it has a spike in memory usage at some point. You ought to be able to test this hypothesis by tweaking the GC options. Try with a very large heap size (-A2G). There's a very small change in the MUT time, which is probably accounted for by extra cache misses caused by the extra GC activity. So I suspect this is nothing to worry about. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: unordered-containers 16% slower in HEAD vs 7.6.3 --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): I've found a likely culprit in the generated Core. Compare the case for `Full` in `$wpoly_go` for [https://ghc.haskell.org/trac/ghc/attachment/ticket/8900/HashMapInsert-7.6.3 .dump-simpl#L569 7.6.3]: {{{ 569 case indexArray# rb_a1cI i#_s1oq of _ { (# ipv2_a1cS #) -> 570 case $wpoly_go ww_s2HT ww1_s2HX w_s2HZ (+# ww2_s2I2 4) ipv2_a1cS 571 of st'_a1cV { __DEFAULT -> }}} vs the case for `Full` in `$wpoly_go` for [https://ghc.haskell.org/trac/ghc/attachment/ticket/8900/HashMapInsert- HEAD.dump-simpl#L554 HEAD]: {{{ 554 case indexArray# dt_a2Ly i#_a2LH 555 of _ [Occ=Dead] { (# ipv2_a2LM #) -> 556 case ipv2_a2LM of st_a2LO { __DEFAULT -> 557 case $wpoly_go ww_s4A3 ww1_s4A7 w_s4zY (+# ww2_s4Ab 4) st_a2LO 558 of st'_a2LP { __DEFAULT -> }}} Both cases correspond to this snippet in `Data.HashMap.Base`: {{{ #!haskell go h k x s t@(Full ary) = let !st = A.index ary i !st' = go h k x (s+bitsPerSubkey) st }}} Here's the definition for `A.index`: {{{ #!haskell index :: Array a -> Int -> a index ary _i@(I# i#) = case indexArray# (unArray ary) i# of (# b #) -> b }}} There's an extra `case` in the HEAD version, which translates into an extra tag bits check in the Cmm. This happens in a couple of places in the core. This `case` is unnecessary since `$wpoly_go` scrutinizes `st_a2LO` immediately. This looks like a regression in strictness analysis. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: unordered-containers 16% slower in HEAD vs 7.6.3 --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): While the extra `case` is definitely a regression, it doesn't seem to be the cause of the time difference. Changing `A.index` to: {{{ #!haskell index :: Array a -> Int -> (# a #) index ary _i@(I# i#) = indexArray# (unArray ary) i# }}} and the snippet from `Data.HashMap.Base` to: {{{ #!haskell go h k x s t@(Full ary) = let !(# st #) = A.index ary i !st' = go h k x (s+bitsPerSubkey) st }}} makes the difference in the Core go away, but the time difference remains. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: unordered-containers 16% slower in HEAD vs 7.6.3 --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): Here are the numbers with `-A2G`: 7.6.3: {{{ $ ./HashMapInsert +RTS -s -A2G 1,191,223,528 bytes allocated in the heap 3,400 bytes copied during GC 36,080 bytes maximum residency (1 sample(s)) 13,072 bytes maximum slop 2081 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.01s 0.01s 0.0064s 0.0064s INIT time 0.01s ( 0.02s elapsed) MUT time 0.58s ( 0.93s elapsed) GC time 0.01s ( 0.01s elapsed) EXIT time 0.01s ( 0.01s elapsed) Total time 0.60s ( 0.96s elapsed) %GC time 1.0% (0.7% elapsed) Alloc rate 2,063,221,325 bytes per MUT second Productivity 97.6% of total user, 61.0% of total elapsed }}} HEAD: {{{ $ ./HashMapInsert +RTS -s -A2G 1,191,223,096 bytes allocated in the heap 3,312 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 13,160 bytes maximum slop 2081 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.01s 0.02s 0.0158s 0.0158s INIT time 0.01s ( 0.02s elapsed) MUT time 0.60s ( 0.93s elapsed) GC time 0.01s ( 0.02s elapsed) EXIT time 0.01s ( 0.02s elapsed) Total time 0.62s ( 0.99s elapsed) %GC time 1.0% (1.6% elapsed) Alloc rate 1,998,679,702 bytes per MUT second Productivity 97.7% of total user, 61.3% of total elapsed }}} I'll accept Simon M's explanation of the difference. I'll leave the ticket open for the strictness analysis issue. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Description changed by tibbe: Old description:
I ran a simple benchmark that exercises [https://github.com/tibbe /unordered-containers/blob/master/Data/HashMap/Base.hs#L303 Data.HashMap.Lazy.insert]. It's 16% slower using HEAD compared to using 7.6.3. The generated Core is a bit different and the generated Cmm is quite a bit different.
'''Steps to reproduce'''
1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD:
{{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}}
3. Compile the benchmark with both compilers:
{{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}}
'''Results (best of 3 runs)'''
7.6.3
{{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s
INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed)
%GC time 34.8% (40.3% elapsed)
Alloc rate 4,923,204,681 bytes per MUT second
Productivity 65.2% of total user, 59.0% of total elapsed }}}
HEAD:
{{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s
INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed)
%GC time 41.6% (47.5% elapsed)
Alloc rate 4,738,791,249 bytes per MUT second
Productivity 58.3% of total user, 51.9% of total elapsed }}}
(Note that this is without the patches in #8885, so they're not the cause.)
An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related.
New description: Edit: There were two issues discussed here. One is solved. I left the ticket open for the strictness analysis regression part. I ran a simple benchmark that exercises [https://github.com/tibbe /unordered-containers/blob/master/Data/HashMap/Base.hs#L303 Data.HashMap.Lazy.insert]. It's 16% slower using HEAD compared to using 7.6.3. The generated Core is a bit different and the generated Cmm is quite a bit different. '''Steps to reproduce''' 1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD: {{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}} 3. Compile the benchmark with both compilers: {{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}} '''Results (best of 3 runs)''' 7.6.3 {{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed) %GC time 34.8% (40.3% elapsed) Alloc rate 4,923,204,681 bytes per MUT second Productivity 65.2% of total user, 59.0% of total elapsed }}} HEAD: {{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed) %GC time 41.6% (47.5% elapsed) Alloc rate 4,738,791,249 bytes per MUT second Productivity 58.3% of total user, 51.9% of total elapsed }}} (Note that this is without the patches in #8885, so they're not the cause.) An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Changes (by tibbe): * version: 7.9 => 7.8.1-rc2 Comment: Strictness analysis regression is also in the latest 7.8 snapshot I had lying around, ghc-7.8.0.20140228, so it's likely in 7.8 RC. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Description changed by tibbe: Old description:
Edit: There were two issues discussed here. One is solved. I left the ticket open for the strictness analysis regression part.
I ran a simple benchmark that exercises [https://github.com/tibbe /unordered-containers/blob/master/Data/HashMap/Base.hs#L303 Data.HashMap.Lazy.insert]. It's 16% slower using HEAD compared to using 7.6.3. The generated Core is a bit different and the generated Cmm is quite a bit different.
'''Steps to reproduce'''
1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD:
{{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}}
3. Compile the benchmark with both compilers:
{{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}}
'''Results (best of 3 runs)'''
7.6.3
{{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s
INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed)
%GC time 34.8% (40.3% elapsed)
Alloc rate 4,923,204,681 bytes per MUT second
Productivity 65.2% of total user, 59.0% of total elapsed }}}
HEAD:
{{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s
INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed)
%GC time 41.6% (47.5% elapsed)
Alloc rate 4,738,791,249 bytes per MUT second
Productivity 58.3% of total user, 51.9% of total elapsed }}}
(Note that this is without the patches in #8885, so they're not the cause.)
An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related.
New description: Edit: There were two issues discussed here. One is solved. I left the ticket open for the strictness analysis regression part. Analysis of strictness regression starts in comment 7 below. I ran a simple benchmark that exercises [https://github.com/tibbe /unordered-containers/blob/master/Data/HashMap/Base.hs#L303 Data.HashMap.Lazy.insert]. It's 16% slower using HEAD compared to using 7.6.3. The generated Core is a bit different and the generated Cmm is quite a bit different. '''Steps to reproduce''' 1. Download the attached `HashMapInsert.hs` benchmark. 2. Install unordered-containers with both 7.6.3 and HEAD: {{{ $ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3 $ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3 }}} 3. Compile the benchmark with both compilers: {{{ $ ghc-7.6.3 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertOld $ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs $ mv HashMapInsert HashMapInsertNew }}} '''Results (best of 3 runs)''' 7.6.3 {{{ $ ./HashMapInsertOld +RTS -s 1,191,223,528 bytes allocated in the heap 141,978,520 bytes copied during GC 37,811,840 bytes maximum residency (8 sample(s)) 22,378,432 bytes maximum slop 99 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2277 colls, 0 par 0.06s 0.06s 0.0000s 0.0002s Gen 1 8 colls, 0 par 0.07s 0.10s 0.0127s 0.0479s INIT time 0.00s ( 0.00s elapsed) MUT time 0.24s ( 0.24s elapsed) GC time 0.13s ( 0.17s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.37s ( 0.41s elapsed) %GC time 34.8% (40.3% elapsed) Alloc rate 4,923,204,681 bytes per MUT second Productivity 65.2% of total user, 59.0% of total elapsed }}} HEAD: {{{ $ ./HashMapInsertNew +RTS -s 1,191,223,128 bytes allocated in the heap 231,158,688 bytes copied during GC 55,533,064 bytes maximum residency (13 sample(s)) 22,378,488 bytes maximum slop 144 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2268 colls, 0 par 0.06s 0.07s 0.0000s 0.0003s Gen 1 13 colls, 0 par 0.12s 0.16s 0.0127s 0.0468s INIT time 0.00s ( 0.00s elapsed) MUT time 0.25s ( 0.25s elapsed) GC time 0.18s ( 0.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 0.43s ( 0.49s elapsed) %GC time 41.6% (47.5% elapsed) Alloc rate 4,738,791,249 bytes per MUT second Productivity 58.3% of total user, 51.9% of total elapsed }}} (Note that this is without the patches in #8885, so they're not the cause.) An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonmar): If that small difference in the MUT time is reliable, there might be a regression in the code generator we need to look into. Were those numbers with or without the extra case expression? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): Without the extra case expression, there's still a small difference. Using the best of 5 runs, here are the numbers: HEAD (46d05ba03d1491cade4a3fe33f0b8c404ad3c760): {{{ $ ./HashMapInsert +RTS -s -A2G 1,191,223,096 bytes allocated in the heap 3,312 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 13,160 bytes maximum slop 2081 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.01s 0.01s 0.0060s 0.0060s INIT time 0.01s ( 0.02s elapsed) MUT time 0.60s ( 0.89s elapsed) GC time 0.01s ( 0.01s elapsed) EXIT time 0.01s ( 0.01s elapsed) Total time 0.62s ( 0.92s elapsed) %GC time 0.9% (0.6% elapsed) Alloc rate 1,991,359,179 bytes per MUT second Productivity 97.8% of total user, 66.0% of total elapsed }}} 7.6.3: {{{ $ ./HashMapInsert +RTS -s -A2G 1,191,223,528 bytes allocated in the heap 3,400 bytes copied during GC 36,080 bytes maximum residency (1 sample(s)) 13,072 bytes maximum slop 2081 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.01s 0.01s 0.0057s 0.0057s INIT time 0.01s ( 0.02s elapsed) MUT time 0.59s ( 0.86s elapsed) GC time 0.01s ( 0.01s elapsed) EXIT time 0.01s ( 0.01s elapsed) Total time 0.61s ( 0.90s elapsed) %GC time 0.9% (0.6% elapsed) Alloc rate 2,033,686,150 bytes per MUT second Productivity 97.8% of total user, 66.4% of total elapsed }}} I've attached the Cmm for both HEAD and 7.6.3. They're not trivial to compare as in HEAD everything is in one function, `$wpoly_go_entry`, while in 7.6.3 it's split over two, `$wpoly_go_info` and `s2ZS_ret`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonpj): Johan, the "extra case" comes from commit 28d9a03253e8fd613667526a170b684f2017d299, whose comment says {{{ Make CaseElim a bit less aggressive See Note [Case elimination: lifted case]: We used to do case elimination if (c) the scrutinee is a variable and 'x' is used strictly But that changes case x of { _ -> error "bad" } --> error "bad" which is very puzzling if 'x' is later bound to (error "good"). Where the order of evaluation is specified (via seq or case) we should respect it. }}} And indeed that's a good point. `Note [Case binder next]` in `Simplify.lhs` is relevant here. You point out that `$wpoly_go` is strict in its last arg, so it really is "next". I wonder how picky we want to be. Consider {{{ case x of y -> g (error "uk") y }}} and suppose that g is strict in both arguments. Would it be ok to drop the case? Then we might get `error "uk"` (if `g` evaluated its first arg first) rather than any error arising from `x`? Probably yes, I guess. Indeed maybe the change is just wrong, or at least over-conservative. According to our paper "A semantics of imprecise exceptions" it is ok. I don't think anyone actually reported a bug. So maybe we should revert to the ghc 7.6 behaviour? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): I don't know what the right approach is. If we decide to keep the new behavior I'll have to be more careful about any extra evaluation, as I can't trust GHC to remove it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonmar): I'm in favour of being gung-ho about evaluation order, especially when there's an optimisation to be had. If the user wants to control evaluation order, then `pseq` is the right way to do it, not `seq`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression
--------------------------------------------+------------------------------
Reporter: tibbe | Owner:
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.8.1-rc2
Resolution: | Keywords:
Operating System: MacOS X | Architecture: x86_64
Type of failure: Runtime performance bug | (amd64)
Test Case: | Difficulty: Unknown
Blocking: | Blocked By:
| Related Tickets:
--------------------------------------------+------------------------------
Comment (by Simon Peyton Jones

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonpj): OK I have restored the old behaviour. (No need to merge.) I did a nofib run and found * No visible change in binary size * No change in allocation * Wibbles up and down in runtime but nothing consistent. (On my machine running `k-nucleotide` varies by up to 10% runtime in successive runes, for example.) So I don't think there are significant losses, and Johan definitely has a gain to get. I have not actually tried Johan's example, but perhaps Johan can? Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: fixed | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Changes (by tibbe): * status: new => closed * resolution: => fixed Comment: My code now looks good. Thanks! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: fixed | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonpj): Comment 9 you said "While the extra case is definitely a regression, it doesn't seem to be the cause of the time difference". and in comment 14 "without the extra case there is still a small difference". Is that still true? If so there might still be something to track down. Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: fixed | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): Replying to [comment:21 simonpj]:
Comment 9 you said "While the extra case is definitely a regression, it doesn't seem to be the cause of the time difference". and in comment 14 "without the extra case there is still a small difference".
Is that still true? If so there might still be something to track down.
There's still a (quite small) difference in mutator time. I haven't had time to investigate it. The Cmm output by 7.6.3 (attachment:HashMapInsert-7.6.3.dump-opt-cmm) and HEAD (attachment:HashMapInsert-HEAD.dump-opt-cmm) look quite a bit different, at least superficially. For example, 7.6.3 has the core loop split into two functions, `$wpoly_go_info` and `s30b_ret`, while HEAD has just one function, `$wpoly_go_entry`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: fixed | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): I've looked a bit more closely and one difference between 7.6.3 and HEAD is the extra stack spilling before the eval check I reported in #8905. Another difference is that the recursive call appears as {{{ jump $wpoly_go_info; // [R6, R5, R4, R3, R2] }}} in 7.6.3 but as {{{ call $wpoly_go_info(R6, R5, R4, R3, R2) returns to c56A, args: 8, res: 8, upd: 8; }}} in HEAD. I don't know if this is just a syntactic difference. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

one difference between 7.6.3 and HEAD is the extra stack spilling before
#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: fixed | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonmar): the eval check I reported in #8905 I was surprised by this, because I would expect HEAD and 7.6.3 to generate very similar code with respect to spilling before a call. #8905 is about an improvement we can make in the new code generator, that wasn't possible in the old codegen. Looking at your dumps I see this for HEAD in `$wpoly_go_info`: {{{ c5k9: I64[Sp - 40] = PicBaseReg + block_c53R_info; R1 = R6; I64[Sp - 32] = R2; I64[Sp - 24] = R3; P64[Sp - 16] = R4; I64[Sp - 8] = R5; Sp = Sp - 40; if (R1 & 7 != 0) goto c53R; else goto c53S; }}} and this for 7.6.3: {{{ I64[Sp - 32] = R5; I64[Sp - 24] = R4; I64[Sp - 16] = R3; I64[Sp - 8] = R2; R1 = R6; I64[Sp - 40] = PicBaseReg + s30b_info; Sp = Sp - 40; if (R1 & 7 != 0) goto c3zM; }}} they look identical to me modulo reordering and things falling into different stack slots. Maybe the problematic bit is somewhere else - could you point to it? Some of the other differences you're seeing are due to the fact that the new codegen (with the NCG) doesn't break up functions at proc-points, so you see larger chunks of code where 7.6.3 broke things into smaller pieces. Most of the time this won't make any difference to the generated code, unless there's a join point (a let-no-escape) where HEAD should generate better code. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I was surprised by this, because I would expect HEAD and 7.6.3 to generate very similar code with respect to spilling before a call. #8905 is about an improvement we can make in the new code generator, that wasn't
#8900: Strictness analysis regression --------------------------------------------+------------------------------ Reporter: tibbe | Owner: Type: bug | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.8.1-rc2 Resolution: fixed | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime performance bug | (amd64) Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): Replying to [comment:24 simonmar]: possible in the old codegen. My mistake. I've been reading too much Cmm lately.
they look identical to me modulo reordering and things falling into different stack slots. Maybe the problematic bit is somewhere else - could you point to it?
I have no idea then. There's a lot of Cmm and diff tools don't work well because blocks have been reordered a lot between 7.6.3 and HEAD. Since the difference is so small I suggest we ignore it for now. I'm more interested in getting #8905 into a production-ready state. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8900#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC