[GHC] #15544: Possible segmentation fault in cryptohash-sha256 testsuite

#15544: Possible segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: None/Unknown Unknown/Multiple | Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- {{{ $ cabal get cryptohash-sha256-0.11.101.0 $ cabal new-run -w ghc-8.6.1 --enable-test --allow-newer=cryptohash- sha256:base,*:stm,*:tasty,async:base test:test-sha256 -- -j 8 --quickcheck-tests 9999 }}} Eventually the program will start spamming stderr with `test-sha256: lost signal due to full pipe: 11` repeatedly. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Possible segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * priority: normal => highest Comment: Sometimes the program also outright segfaults. The stderr message appears to be due to the timer manager control fd filling. I suspect this thread is killed due to the segfault and consequently the pipe fills, resulting in the messages. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
{{{ $ cabal get cryptohash-sha256-0.11.101.0 $ cabal new-run -w ghc-8.6.1 --enable-test --allow-newer=cryptohash- sha256:base,*:stm,*:tasty,async:base test:test-sha256 -- -j 8 --quickcheck-tests 9999 }}}
Eventually the program will start spamming stderr with `test-sha256: lost signal due to full pipe: 11` repeatedly.
New description: {{{ $ cabal get cryptohash-sha256-0.11.101.0 $ cabal new-run -w ghc-8.6.1 --enable-test --allow-newer=cryptohash- sha256:base,*:stm,*:tasty,async:base test:test-sha256 -- -j 8 --quickcheck-tests 9999 }}} Eventually the program will start spamming stderr with `test-sha256: lost signal due to full pipe: 11` repeatedly. This apparently only started with 8.6.1. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Old description:
{{{ $ cabal get cryptohash-sha256-0.11.101.0 $ cabal new-run -w ghc-8.6.1 --enable-test --allow-newer=cryptohash- sha256:base,*:stm,*:tasty,async:base test:test-sha256 -- -j 8 --quickcheck-tests 9999 }}}
Eventually the program will start spamming stderr with `test-sha256: lost signal due to full pipe: 11` repeatedly. This apparently only started with 8.6.1.
New description: {{{ $ cabal get cryptohash-sha256-0.11.101.0 $ cd cryptohash-sha256-0.11.101.0 $ cabal new-run -w ghc-8.6.1 --enable-test --allow- newer=*:base,*:stm,*:tasty test:test-sha256 -- -j 8 --quickcheck-tests 9999 }}} Eventually the program will start spamming stderr with `test-sha256: lost signal due to full pipe: 11` repeatedly. This apparently only started with 8.6.1. -- Comment (by sjakobi): Another variation I'm seeing when I add `--timeout 1s` is {{{ XL-vec inc: test-sha256: internal error: evacuate: strange closure type 0 (GHC version 8.6.0.20180810 for x86_64_unknown_linux) }}} Strangely I can't reproduce the bug when I try to run the apparently problematic testcase with `--pattern KATs.XL-vec.inc`. Also note that a similar issue has been reported for GHC-8.2.2 on 32-mips at https://github.com/haskell-hvr/cryptohash-sha256/issues/3. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): The fact that this happened on MIPS earlier is a very interesting data point. I'm not yet sure what to make of it; perhaps it's a missing memory barrier? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): So I finally had a chance to reproduce this with a compiler built with debugging symbols. Interestingly, the first segfault I've seen crashed in this environment was in the SHA implementation itself: {{{ Thread 2 received signal SIGSEGV, Segmentation fault. [Switching to Thread 16174.16189] 0x000000000062aaae in sha256_do_chunk ()
disassemble Dump of assembler code for function sha256_do_chunk: 0x000000000062a5d0 <+0>: push %r15 0x000000000062a5d2 <+2>: push %r14 0x000000000062a5d4 <+4>: push %r13 0x000000000062a5d6 <+6>: push %r12 0x000000000062a5d8 <+8>: push %rbp 0x000000000062a5d9 <+9>: push %rbx 0x000000000062a5da <+10>: sub $0x158,%rsp 0x000000000062a5e1 <+17>: mov %fs:0x28,%rax ... 0x000000000062aaa7 <+1239>: pop %r13 0x000000000062aaa9 <+1241>: pop %r14 0x000000000062aaab <+1243>: pop %r15 0x000000000062aaad <+1245>: retq => 0x000000000062aaae <+1246>: movdqu (%rsi),%xmm0 0x000000000062aab2 <+1250>: lea 0x40(%r11),%rcx 0x000000000062aab6 <+1254>: mov %r11,%rax 0x000000000062aab9 <+1257>: movaps %xmm0,0x40(%rsp) ... bt #0 0x000000000062aaae in sha256_do_chunk () #1 0x000000000062c05f in ghczuwrapperZC4ZCcryptohashzmsha256zm0zi11zi101zi0zminplaceZCCryptoziHashziSHA256ziFFIZChszucryptohashzusha256zuupdate () #2 0x00000000006278a5 in s7zn_info () #3 0x0000000000000000 in ?? () }}} At first I suspected this was an alignment issue but no, `movdqu` is an unaligned move.
The value of `$rsi` is quite suspicious: {{{
print /a $rsi $1 = 0x510000004200b85a }}} In fact, it seems that the crash occurs essentially as soon as we enter `sha256_do_chunk`. Tracing execution back into Haskell it looks like this crazy pointer comes from the C stack: {{{ Dump of assembler code for function s7zn_info: ... 0x0000000000627864 <+596>: xor %esi,%esi 0x0000000000627866 <+598>: mov %rax,%r8 0x0000000000627869 <+601>: xor %eax,%eax 0x000000000062786b <+603>: mov %r8,%r14 0x000000000062786e <+606>: mov %rdx,0x48(%rsp) (B) 0x0000000000627873 <+611>: mov %rcx,0x50(%rsp) 0x0000000000627878 <+616>: callq 0x7a2e00 <suspendThread> 0x000000000062787d <+621>: add $0x8,%rsp 0x0000000000627881 <+625>: sub $0x8,%rsp => 0x0000000000627885 <+629>: mov 0x48(%rsp),%rcx (A) 0x000000000062788a <+634>: mov 0x50(%rsp),%rdx 0x000000000062788f <+639>: add %rdx,%rcx 0x0000000000627892 <+642>: mov %rbx,%rdx 0x0000000000627895 <+645>: mov %r14,%rdi 0x0000000000627898 <+648>: mov %rcx,%rsi 0x000000000062789b <+651>: mov %rax,%rbx 0x000000000062789e <+654>: xor %eax,%eax 0x00000000006278a0 <+656>: callq 0x62c000 <ghczuwrapperZC4ZCcryptohashzmsha256zm0zi11zi101zi0zminplaceZCCryptoziHashziSHA256ziFFIZChszucryptohashzusha256zuupdate> }}} Where the stack at point (A) looks like this, {{{ x/16a $rsp+0x38 0x7f3f82107e18: 0x0 0x0 0x7f3f82107e28: 0xa800000042004d27 0xa900000000006b33 <==== yikes 0x7f3f82107e38: 0x42003e4301 0x42003e4310 0x7f3f82107e48: 0x42003e43a1 0x42003e4909 }}} Tracing further back I end up at point (B), {{{ Continuing.
Thread 2 hit Hardware watchpoint 1: *(void**) 0x7f3f82107e28 Old value = (void *) 0xa800000042004d27 New value = (void *) 0x42004e0d80 0x000000000062786e in s7zn_info () }}} Continuing to trace things back, it seems that these pointers are loaded from a stack frame, {{{
x/8a $rbx-1 0x4200088130: 0x6b0550
0x520000000000a02d 0x4200088140: 0xa800000042004d27 0xa900000000006b33 0x4200088150: 0x6a00000042004d26 0x797f38 0x4200088160: 0x4200088131 0x4200088118 }}}
-- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

bt #0 0x00000000007c8370 in stg_IND_STATIC_info () at rts/StgMiscClosures.cmm:270 #1 0x00000000006ae1f0 in bytestringzm0zi10zi8zi2_DataziByteStringziLazzy_fromChunkszugo_info () at
#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): The previous debugging session ended up looking a bit messy. Starting afresh, this time with `+RTS -DS`. Things are looking much clearer now. We get the following segfault: {{{ Thread 2 received signal SIGSEGV, Segmentation fault. 0x00000000007c8370 in stg_IND_STATIC_info () at rts/StgMiscClosures.cmm:270 270 { libraries/bytestring/Data/ByteString/Lazy.hs:267 #2 0x00000000007c5388 in ?? () at rts/Updates.cmm:31 #3 0x00000000006adc50 in bytestringzm0zi10zi8zi2_DataziByteStringziLazzy_toChunkszugo_info () at libraries/bytestring/Data/ByteString/Lazy.hs:271 #4 0x00000000007c5388 in ?? () at rts/Updates.cmm:31 #5 0x0000000000624d20 in s7zn_info () #6 0x0000000000000000 in ?? () }}} where `$rbx` points to memory cleared by the RTS: {{{
x/8a $rbx 0x4200365960: 0xaaaaaaaaaaaaaaaa 0xaaaaaaaaaaaaaaaa 0x4200365970: 0xaaaaaaaaaaaaaaaa 0xaaaaaaaaaaaaaaaa 0x4200365980: 0xaaaaaaaaaaaaaaaa 0xaaaaaaaaaaaaaaaa 0x4200365990: 0xaaaaaaaaaaaaaaaa 0xaaaaaaaaaaaaaaaa }}} Tracing this back it looks like this once resided in a nursery that has since been freed. It sounds like something is being freed too soon (perhaps something like #14346/#15260). Looking at the package implementation it looks like there are a few `withForeignPtr` uses. This is quite suspicious.
-- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by osa1: Old description:
{{{ $ cabal get cryptohash-sha256-0.11.101.0 $ cd cryptohash-sha256-0.11.101.0 $ cabal new-run -w ghc-8.6.1 --enable-test --allow- newer=*:base,*:stm,*:tasty test:test-sha256 -- -j 8 --quickcheck-tests 9999 }}}
Eventually the program will start spamming stderr with `test-sha256: lost signal due to full pipe: 11` repeatedly. This apparently only started with 8.6.1.
New description: {{{ $ cabal get cryptohash-sha256-0.11.101.0 $ cd cryptohash-sha256-0.11.101.0 $ cabal new-run -w ghc-8.6.1 --enable-test --allow- newer="*:base,*:stm,*:tasty" test:test-sha256 -- -j 8 --quickcheck-tests 9999 }}} Eventually the program will start spamming stderr with `test-sha256: lost signal due to full pipe: 11` repeatedly. This apparently only started with 8.6.1. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * cc: simonmar (added) Comment: It is looking very much like this is due to the SRT rework. Specifically, I bisected this down to one of the following: {{{ There are only 'skip'ped commits left to test. The first bad commit could be any of: f2d27c1ad69321872a87a37144fe41e815301f5b 0c7db226012b5cfafc9a38bfe372661672ec8900 b701e4754d1d6286e233c73d7360926f3eaae577 4ffaf4b67773af4c72d92bb8b6c87b1a7d34ac0f 5f15d53a98ad2f26465730d8c3463ccc58f6d94a 99f8cc84a5b23878b3b0732955cb651bc973e9f2 f27e4f624fe1270e8027ff0a14f03514f5be31b7 126b4125d95f7e4d272a9307cb8b634b11bd337f 5d3b15ecbf17b7747c2f7313a981c60a2d22904d 3310f7f14c0ba34a57fe5a77f47d2a66fe838a43 819b9cfd21a1773091cec4e34716a0fd7c7d05c6 01bb17fd4dc6d92cf08632bbb62656428db6e7fa 797a46239d958841219f0f7769b0016b1b23d5ca 838b69032566ce6ab3918d70e8d5e098d0bcee02 efe405447b9fa88cebce718a6329091755deb9ad 2b0918c9834be1873728176e4944bec26271234a 2bbdd00c6d70bdc31ff78e2a42b26159c8717856 5a7c657e02b1e801c84f26ea383f326234cd993c fbd28e2c6b5f1302cd2d36d79149e3b0a9f01d84 ae292c6d1362f34117be75a2553049cec7022244 eb8e692cab7970c495681e14721d05ecadd21581 d78dde9ff685830bc9d6bb24a158eb31bb8a7028 We cannot bisect more! }}} Unfortunately bisecting through these has been extremely messy due to non- building commits. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by alpmestan): * cc: alpmestan (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * cc: osa1 (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I can't reproduce any failures in this program without using debug runtime and enabling sanity checks with `-debug -rtsopts -with-rtsopts=-DS` (tried with `--timeout 1s` too). When I enable sanity checks I get: (this is with GHC HEAD) {{{ test-sha256: internal error: ASSERTION FAILED: file rts/sm/Sanity.c, line 851 (GHC version 8.7.20180827 for x86_64_unknown_linux) Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug zsh: abort (core dumped) cabal new-run --with-ghc=ghc-stage2 --enable- test --allow-newer -- -j 8 999 }}} every once in a few runs. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I tried to reproduce this on another x86_64 system and failed. bgamari, are you reproducing this with only the command shown in the description? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Yes, I can reliably reproduce this on both machines I've tried on. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I'm trying to debug this -- I think this may be related with 7fc418df856d9b58034eeec48915646e67a7a562. Can someone who can reproduce the segfault try this with this commit reverted? I can't reproduce the segfault on my two x86_64 systems (even with `--timeout 1s`). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): @osa1 what makes you suspect the STM fix? @bgamari's earlier debugging seemed to suggest that it was SRT-related, in particular if we're crashing in `stg_IND_STATIC` that usually indicates a CAF has been GC'd too early. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1):
@osa1 what makes you suspect the STM fix?
I'm debugging the assertion failure in comment:12 which looked serious enough to me (a TSO list is getting corrupted). I realized that the list that's being corrupted is a run queue, and the reason it's being corrupted is because in `stmCommitTransaction` we unpark a thread that is already in a run queue. So at some point the thread is in two lists (in both a run queue and a TRec's wait queue). This is the point where we corrupt the list: {{{ We're unpark_tso()'ing a thread that is already in a run queue. 352 if (tso->block_info.closure != &stg_STM_AWOKEN_closure) { 353 // safe to do a non-atomic test-and-set here, because it's 354 // fine if we do multiple tryWakeupThread()s. 355 tso->block_info.closure = &stg_STM_AWOKEN_closure; 356 tryWakeupThread(cap,tso); 357 } Old value = (StgTSO *) 0x104df58 New value = (StgTSO *) 0x42001d9000 0x0000000000dcb2b3 in unpark_tso (cap=0x104f6c0 <MainCapability>, tso=0x42001d9078) at rts/STM.c:355 355 tso->block_info.closure = &stg_STM_AWOKEN_closure;
bt #0 0x0000000000dcb2b3 in unpark_tso (cap=0x104f6c0 <MainCapability>, tso=0x42001d9078) at rts/STM.c:355 #1 0x0000000000dcb35c in unpark_waiters_on (cap=0x104f6c0 <MainCapability>, s=0x42001c2070) at rts/STM.c:374 #2 0x0000000000dcd2d2 in stmCommitTransaction (cap=0x104f6c0 <MainCapability>, trec=0x4200037c50) at rts/STM.c:1092 #3 0x0000000000dee080 in stg_atomically_frame_info () #4 0x0000000000000000 in ?? () }}}
(note that this is reverse execution so "Old value" is actually the new value) The thread is already in a run queue: {{{
print tso $23 = (StgTSO *) 0x42001d9078
print MainCapability->run_queue_hd->_link->_link $25 = (struct StgTSO_ *) 0x42001d9078 }}}
At this point the TSO link is fine: {{{
print MainCapability->run_queue_hd->_link->_link->block_info.prev == MainCapability->run_queue_hd->_link $29 = 1 }}}
Because the STM fix changed `unpark_tso()` I thought it may be related. I don't yet know how this thread ends up in two lists, I'll investigate further. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): Aha, yes I think you've found a problem. The run queue is doubly-linked, and the `block_info` field of the TSO is used as the back link, but we're overwriting that pointer in `unpark_tso`. I'd forgotten about the double use of `block_info` when I wrote that patch. I don't know if this causes the original problem, there might still be a SRT problem, but this queue corruption is definitely a bug. I guess we shouldn't touch the `block_info` field here. Would you like to make a patch? Unfortunately that reintroduces the problem of how to avoid repeated wakeup messages. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): Dumb question: can we not remove unparked TSOs from the wait list in unpark_waiters_on()?
Would you like to make a patch?
What do you want the patch to do? Do you want to unconditionally try to wake up a thread? (by sending multiple wakup messages sometimes) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

print *((StgClosure *) 0xe4c558) $21 = {
#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite
-------------------------------------+-------------------------------------
Reporter: bgamari | Owner: (none)
Type: bug | Status: new
Priority: highest | Milestone: 8.6.1
Component: Compiler | Version: 8.4.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by osa1):
I managed to reproduce it and did some debugging.
Here's the problem. We have this object:
{{{
header = {
info = 0x409968 , Unf=OtherCon []] =
sat-only [] \r [ww_seSe]
case ww_seSe of ds1_seSf [Occ=Once] {
__DEFAULT ->
let {
sat_seSk [Occ=Once] ::
[Data.ByteString.Internal.ByteString]
[LclId] =
[ds1_seSf] \u []
case -# [ds1_seSf 1#] of sat_seSg [Occ=Once] {
__DEFAULT ->
case $wxs_reFi sat_seSg of {
(#,#) ww2_seSi [Occ=Once] ww3_seSj
[Occ=Once] ->
: [ww2_seSi ww3_seSj];
};
};
} in (#,#) [x_reFh sat_seSk];
1# -> (#,#) [x_reFh GHC.Types.[]];
};
}}}
Notice that (1) it's a FUN_STATIC (2) it has references to another static
object x_reFh:
{{{
x_reFh :: Data.ByteString.Internal.ByteString
[GblId] =
[] \u []
case
newMutVar# [GHC.ForeignPtr.NoFinalizers GHC.Prim.realWorld#]
of
{ (#,#) ipv_seS6 [Occ=Once] ipv1_seS7 [Occ=Once] ->
case __pkg_ccall bytestring-0.10.8.2 [addr#1_reFg ipv_seS6]
of {
(#,#) _ [Occ=Dead] ds2_seSb [Occ=Once] ->
case word2Int# [ds2_seSb] of sat_seSd [Occ=Once] {
__DEFAULT ->
let {
sat_seSc [Occ=Once] ::
GHC.ForeignPtr.ForeignPtrContents
[LclId] =
CCCS GHC.ForeignPtr.PlainForeignPtr!
[ipv1_seS7];
} in
Data.ByteString.Internal.PS [addr#1_reFg
sat_seSc 0# sat_seSd];
};
};
};
}}}
The FUN_STATIC SRT optimization should apply to this object. So instead of
a SRT table we should have the SRT entries in its payload. However n_ptrs
of this object is 0:
{{{
set $itbl = itbl_to_fun_itbl(get_itbl((StgClosure *) 0xe4c558)) print *$itbl $21 = { f = { slow_apply_offset = 59278791, __pad_slow_apply_offset = 1572864, b = { bitmap = 10376465356425854976, bitmap_offset = -907476992, __pad_bitmap_offset = 3387490304 }, fun_type = 4, arity = 1 }, i = { layout = { payload = { ptrs = 0, nptrs = 0 }, bitmap = 0, large_bitmap_offset = 0, __pad_large_bitmap_offset = 0, selector_offset = 0 }, type = 14, srt = 10759120, code = 0x409968
"I\203\304\030M;\245X\003" } } }}}
print *((StgClosure*) (((StgWord) (($itbl)+1)) + ($itbl)->i.srt)) <--- GET_FUN_SRT $10 = {
So it seems like for some reason we don't actually do FUN_STATIC SRT
optimization for this objects. Indeed I can get the reference to refH in
the srt field:
{{{
header = {
info = 0x4097e8
print ((StgClosure*) (((StgWord) (($itbl)+1)) + ($itbl)->i.srt)) $11 = (StgClosure *) 0xe4c538 }}}
x_reFh is originally a THUNK and becomes IND_STATIC after evaluation: {{{
call printClosure((StgClosure *) 0xe4c538) THUNK(0x4097e8)
c Hardware watchpoint 5: ((StgClosure *) 0xe4c538)->header.info
Old value = (const StgInfoTable *) 0x4097e8
bt #0 SET_INFO (c=0xe4c538, info=0xdce688
) at includes/rts/storage/ClosureMacros.h:50 #1 0x0000000000dbac9b in lockCAF (reg=0x1020818 , caf=0xe4c538) at rts/sm/Storage.c:415 #2 0x0000000000dbacc5 in newCAF (reg=0x1020818 , caf=0xe4c538) at rts/sm/Storage.c:425 #3 0x0000000000409809 in reFh_info () #4 0x0000000000000000 in ?? ()
call printClosure((StgClosure *) 0xe4c538) IND_STATIC(0x42004d5878) }}}
c ... assertion failure ... bt #0 0x0000000000db8800 in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:260 #1 0x0000000000db884f in LOOKS_LIKE_INFO_PTR (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:265 #2 0x0000000000db8887 in LOOKS_LIKE_CLOSURE_PTR (p=0x4200122a10) at includes/rts/storage/ClosureMacros.h:270 #3 0x0000000000db9240 in evacuate (p=0xe4c540) at rts/sm/Evac.c:516 #4 0x0000000000ddf87e in scavenge_static () at rts/sm/Scav.c:1690 #5 0x0000000000ddff0a in scavenge_loop () at rts/sm/Scav.c:2085 #6 0x0000000000db4c49 in scavenge_until_all_done () at rts/sm/GC.c:1088 #7 0x0000000000db38ba in GarbageCollect (collect_gen=1, do_heap_census=false, gc_type=0, cap=0x1020800 <MainCapability>, idle_cap=0x0) at rts/sm/GC.c:416 #8 0x0000000000d995a7 in scheduleDoGC (pcap=0x7fff635d6780, task=0x2802f60, force_major=false) at rts/Schedule.c:1799 #9 0x0000000000d98a7f in schedule (initialCapability=0x1020800 <MainCapability>, task=0x2802f60) at rts/Schedule.c:545 #10 0x0000000000d99f79 in scheduleWaitThread (tso=0x4200105388, ret=0x0,
Now as long as reFi is reachable this 0xe4c538 should be reachable because it's in SRT of reFi. Let's continue: {{{ pcap=0x7fff635d6880) at rts/Schedule.c:2533 #11 0x0000000000da8b4c in rts_evalLazyIO (cap=0x7fff635d6880, p=0xe4d928, ret=0x0) at rts/RtsAPI.c:530 #12 0x0000000000da9297 in hs_main (argc=7, argv=0x7fff635d6a78, main_closure=0xe4d928, rts_config=...) at rts/RtsMain.c:72 #13 0x000000000041210c in main () }}} 0xe4c540 is indirectee of 0xe4c538: {{{
print &((StgInd*)0xe4c538)->indirectee $27 = (StgClosure **) 0xe4c540 }}}
print *UNTAG_CLOSURE(((StgInd*)0xe4c538)->indirectee) $29 = {
But the object was cleared (because this is in sanity mode) {{{ header = { info = 0xaaaaaaaaaaaaaaaa }, payload = 0x4200122a18 } }}} so it became unreachable. For this object to be unreachable reFi should be unreachable too. Let's see if it was reachable in this GC: {{{
break GarbageCollect Breakpoint 6 at 0xdb3492: file rts/sm/GC.c, line 226. break evacuate_static_object if q == 0xe4c558 Breakpoint 7 at 0xdb8f85: file rts/sm/Evac.c, line 333. reverse-continue }}}
Breakpoint 7 is hit first, so it seems like reFi is actually reachable. We should be scavenging it too: {{{
break Scav.c:1675 if p == 0xe4c558 Breakpoint 8 at 0xddf7cc: file rts/sm/Scav.c, line 1675. c
bt #0 scavenge_static () at rts/sm/Scav.c:1675 #1 0x0000000000ddff0a in scavenge_loop () at rts/sm/Scav.c:2085 #2 0x0000000000db4c49 in scavenge_until_all_done () at rts/sm/GC.c:1088 #3 0x0000000000db38ba in GarbageCollect (collect_gen=1, do_heap_census=false, gc_type=0, cap=0x1020800 <MainCapability>, idle_cap=0x0) at rts/sm/GC.c:416 #4 0x0000000000d995a7 in scheduleDoGC (pcap=0x7fff635d6780, task=0x2802f60, force_major=false) at rts/Schedule.c:1799 #5 0x0000000000d98a7f in schedule (initialCapability=0x1020800 <MainCapability>, task=0x2802f60) at rts/Schedule.c:545 #6 0x0000000000d99f79 in scheduleWaitThread (tso=0x4200105388, ret=0x0, pcap=0x7fff635d6880) at rts/Schedule.c:2533 #7 0x0000000000da8b4c in rts_evalLazyIO (cap=0x7fff635d6880, p=0xe4d928, ret=0x0) at rts/RtsAPI.c:530 #8 0x0000000000da9297 in hs_main (argc=7, argv=0x7fff635d6a78, main_closure=0xe4d928, rts_config=...) at rts/RtsMain.c:72 #9 0x000000000041210c in main () }}}
At this point if I step a few more lines I get the original assertion error. So in summary: a FUN_STATIC is reachable, but somehow a static object in its SRT is collected. Alternatively, it could be that the FUN_STATIC becomes unreachable, and somehow become reachable again later. Simon, I'm looking at the implementation of SRT optimization for FUN_STATIC. I don't understand why we look for both the SRT field and nptrs of FUN_STATICs in this code: (evacuate()) {{{ case FUN_STATIC: if (info->srt != 0 || info->layout.payload.ptrs != 0) { evacuate_static_object(STATIC_LINK(info,(StgClosure *)q), q); } return; }}} As far as I understand for FUN_STATICs we should only look at the payload, no? I think that what the note in CmmBuildInfoTables.hs says. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): Sigh that's because the debug RTS is broken. bgamari could we cherry-pick e431d75f8350f25159f9aaa49fe9a504e94bc0a4 to 8.6 branch? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I just realized that the debug RTS issue mentioned in comment:21 does not actually make comment:20 invalid, so I'm still trying to understand the problem described in comment:20. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): - This problem is reproducible even without threaded runtime (should make it easier to debug) - Fixing the STM issue mentioned in comment:17 does not fix this issue (I submitted Phab:D5144 for that) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D5145 Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonmar): * differential: => Phab:D5145 Comment: See Phab:D5145 for a fix (I hope) and commentary on what the problem was. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: patch Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D5145 Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * status: new => patch -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: patch Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D5145 Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): The patch (when applied to 8.6 branch with STM fix, after a distclean) seems to fix the segfault, but I'm still getting {{{ test-sha256: too many pending signals }}} I see that this error was not reported before but it's something I was getting while trying to reproduce the segfault. Maybe there are multiple bugs? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: patch Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D5145 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): I think the error you're seeing is the non-threaded equivalent of `test- sha256: lost signal due to full pipe: 11`. Try turning `-threaded` back on to see what the signal number is. Something in this test installs signal handlers for a lot of different signals (maybe all of them?). This is why we were getting `test-sha256: lost signal due to full pipe: 11` - the program was throwing `SIGSEGV` (signal 11), the handler runs and writes it to a pipe, and the pipe fills up because `SIGSEGV` is thrown repeatedly. When working on the test I disabled the catching of these signals in the RTS, but I don't know where in the test it does this, maybe Tasty? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: patch Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D5145 Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): Interesting, maybe the segfault is still happening then, because I'm getting `test-sha256: lost signal due to full pipe: 11` with threaded runtime. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: patch Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D5145 Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I also managed to get a panic in the GC (non-debug runtime): {{{ Thread 5 (Thread 0x7fffeffff700 (LWP 26724)): #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007ffff69cc801 in __GI_abort () at abort.c:79 #2 0x0000000000d9a175 in rtsFatalInternalErrorFn (s=0xe02de8 "evacuate: strange closure type %d", ap=0x7fffefffeaf8) at rts/RtsMessages.c:186 #3 0x0000000000d9a2bd in barf (s=s@entry=0xe02de8 "evacuate: strange closure type %d") at rts/RtsMessages.c:48 #4 0x000000000040a259 in evacuate1 (p=p@entry=0xe2c500) at rts/sm/Evac.c:862 #5 0x0000000000dc84f8 in scavenge_static () at rts/sm/Scav.c:1690 #6 scavenge_loop1 () at rts/sm/Scav.c:2085 #7 0x0000000000dad542 in scavenge_until_all_done () at rts/sm/GC.c:1085 #8 0x0000000000dade65 in GarbageCollect (collect_gen=collect_gen@entry=1, do_heap_census=do_heap_census@entry=false, gc_type=gc_type@entry=2, cap=0x1000240 <MainCapability>, cap@entry=0x1019f10, idle_cap=idle_cap@entry=0x7fffe4000d80) at rts/sm/GC.c:416 #9 0x0000000000d9c012 in scheduleDoGC (pcap=pcap@entry=0x7fffefffee70, task=task@entry=0x7fffe8000b70, force_major=force_major@entry=false) at rts/Schedule.c:1797 #10 0x0000000000d9c9ea in schedule (initialCapability=initialCapability@entry=0x1000240 <MainCapability>, task=task@entry=0x7fffe8000b70) at rts/Schedule.c:545 #11 0x0000000000d9df4c in scheduleWorker (cap=cap@entry=0x1000240 <MainCapability>, task=task@entry=0x7fffe8000b70) at rts/Schedule.c:2550 #12 0x0000000000da4e07 in workerStart (task=0x7fffe8000b70) at rts/Task.c:444 #13 0x00007ffff72106db in start_thread (arg=0x7fffeffff700) at pthread_create.c:463 #14 0x00007ffff6aad88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: patch Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D5145 Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): When I disable signal handling in tasty I can see the segfault in gdb: {{{ Thread 2 (Thread 27849.27861): #0 0x000000420032dab1 in ?? () #1 0x0000000000000000 in ?? () }}} GHC stack: {{{
python import ghc_gdb ghc backtrace Sp = 0x42001ca2a0 0: RET_SMALL return=0x90bec0
field 0: Ptr 0x42000350a0 : THUNK_0_1 1: UPDATE_FRAME(0x4200035000: THUNK_1_0) 2: RET_SMALL return=0x8f9190 3: UPDATE_FRAME(0x4200035048: THUNK_1_0) 4: RET_SMALL return=0x689d30 field 0: Word 283469849024 5: RET_SMALL return=0xaa5f50 field 0: Ptr 0x42001ea1b0 : ARR_WORDS 6: RET_SMALL return=0x689ee0 7: RET_SMALL return=0x409a20 8: UPDATE_FRAME(0x42001ca430: BLACKHOLE) 9: RET_SMALL return=0x40b270 }}}
-- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite
-------------------------------------+-------------------------------------
Reporter: bgamari | Owner: (none)
Type: bug | Status: patch
Priority: highest | Milestone: 8.6.1
Component: Compiler | Version: 8.4.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D5145
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by simonmar):
For future archaeology, here are the details of what caused the crash.
The fix has comments to explain the underlying problem and the fix, but
here I want to put the full details of how it manifested in this
particular instance, just in case we need to revisit.
The program evaluates a CAF after it has been GC'd.
The best way to diagnose it is to add `-debug` to the `ghc-options` in the
`.cabal` file, and make sure that you have Phab:D4963 merged (this wasn't
merged in 8.6 at the time, which meant the assertion for GC'd CAFs didn't
fire as it should have).
You can also comment out a bunch of the code in the test case to make it
fail faster and with less code, see Phab:P183
Now, the CAF in question is this:
{{{
x_rbHt :: Data.ByteString.Internal.ByteString
[GblId] =
[] \u []
case
newMutVar# [GHC.ForeignPtr.NoFinalizers GHC.Prim.realWorld#]
of
{ (#,#) ipv_sbMX [Occ=Once] ipv1_sbMY [Occ=Once] ->
case __pkg_ccall bytestring-0.10.8.2 [addr#1_rbHs ipv_sbMX]
of {
(#,#) _ [Occ=Dead] ds2_sbN2 [Occ=Once] ->
case word2Int# [ds2_sbN2] of sat_sbN4 [Occ=Once] {
__DEFAULT ->
let {
sat_sbN3 [Occ=Once] ::
GHC.ForeignPtr.ForeignPtrContents
[LclId] =
CCCS GHC.ForeignPtr.PlainForeignPtr!
[ipv1_sbMY];
} in
Data.ByteString.Internal.PS [addr#1_rbHs
sat_sbN3 0# sat_sbN4];
};
};
};
}}}
which is referred to by this function:
{{{
$wxs_rbHu
:: GHC.Prim.Int#
-> (# Data.ByteString.Internal.ByteString,
[Data.ByteString.Internal.ByteString] #)
[GblId, Arity=1, Str=, Unf=OtherCon []] =
sat-only [] \r [ww_sbN5]
case ww_sbN5 of ds1_sbN6 [Occ=Once] {
__DEFAULT ->
let {
sat_sbNb [Occ=Once] ::
[Data.ByteString.Internal.ByteString]
[LclId] =
[ds1_sbN6] \u []
case -# [ds1_sbN6 1#] of sat_sbN7 [Occ=Once] {
__DEFAULT ->
case $wxs_rbHu sat_sbN7 of {
(#,#) ww2_sbN9 [Occ=Once] ww3_sbNa
[Occ=Once] ->
: [ww2_sbN9 ww3_sbNa];
};
};
} in (#,#) [x_rbHt sat_sbNb];
1# -> (#,#) [x_rbHt GHC.Types.[]];
};
}}}
Note that
* the function refers to the CAF
* it is recursive, and
* the recursive call is inside a thunk (`sat_sbNb`)
We generated the following SRTs (use `-ddump-cmm` to see this):
{{{
[sat_sbNb_entry() // [R1]
{ info_tbls: [(cc8F,
label: sat_sbNb_info
rep: HeapRep 1 nonptrs { Thunk }
srt: Just x_rbHt_closure),
(cc8H,
label: block_cc8H_info
rep: StackRep []
srt: Nothing)]
$wxs_rbHu_entry() // [R2]
{ info_tbls: [(cc8S,
label: $wxs_rbHu_info
rep: HeapRep static { Fun {arity: 1 fun_type:
ArgSpec 4} }
srt: Just x_rbHt_closure)]
}}}
ie. both the function and the thunk have singleton SRTs, pointing directly
to the CAF. This happens because these two declarations are in cycle, and
the SRT pass assigns all declarations in a cycle the same SRT. The SRT
contains all the references from the RHSs of the declarations, which would
be `{$wxs_rbHu_closure, x_rbHt_closure}` except that we remove the
recursive reference to `$wxs_rbHu_closure` from the set (it's not
necessary to have recursive references in the SRT, the SRT only needs to
point to all the things that can be reached from this group).
The crash occurred as follows. Let's call the thunk `sat_sbNb_entry` "A",
and the function `$wxs_rbHu_entry` "B".
* suppose we GC when A is alive, and B is not otherwise reachable.
* B is "collected", meaning that it doesn't make it onto the static
objects list during this GC, but nothing bad happens yet.
* Next, suppose we enter A, and then call B. (remember that A refers to B)
At the entry point to B, we GC. This puts B on the stack, as part of the
RET_FUN stack frame that gets pushed when we GC at a function entry point.
* This GC will now reach B
* But because B was previous "collected", it breaks the assumption that
static objects are never resurrected. See `Note [STATIC_LINK fields]` in
rts/sm/Storage.h for why this is bad.
* In practice, the GC thinks that B has already been visited, and so
doesn't visit X, and catastrophe ensues.
The breakage is caused by a combination of two things:
1. the SRT for the thunk A doesn't point to the function B, even though it
calls the function.
2. the function's entry code causes a pointer to the function's closure to
appear on the stack, when it wasn't previously visible to the GC.
We opted to fix (1), because it's not clear whether (2) could happen in
other ways.
It turned out that (1) could happen in two ways:
* a "shortcutting" optimisation in SRT generation
* omitting recursive references from the SRT of a recursive group
For completeness, here is what we want to generate instead:
{{{
[sat_sbNb_entry() // [R1]
{ info_tbls: [(cc8F,
label: sat_sbNb_info
rep: HeapRep 1 nonptrs { Thunk }
srt: Just $wxs_rbHu_closure), <--- SRT points to
the function, not the CAF
(cc8H,
label: block_cc8H_info
rep: StackRep []
srt: Nothing)]
stack_info: arg_space: 8 updfr_space: Just 8
}
}}}
--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:31
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite
-------------------------------------+-------------------------------------
Reporter: bgamari | Owner: (none)
Type: bug | Status: patch
Priority: highest | Milestone: 8.6.1
Component: Compiler | Version: 8.4.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D5145
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: merge Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D5145 Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * status: patch => merge -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:33 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15544: Non-deterministic segmentation fault in cryptohash-sha256 testsuite -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: closed Priority: highest | Milestone: 8.6.1 Component: Compiler | Version: 8.4.3 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D5145 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: merge => closed * resolution: => fixed Comment: Merged with 28356f217fe4d314bd5a4f0316b5bced755cbb2f. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15544#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC