[GHC] #13970: Segmentation fault inside threadPaused

#13970: Segmentation fault inside threadPaused
-------------------------------------+-------------------------------------
Reporter: albertov | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone:
Component: Runtime | Version: 8.2.1-rc3
System |
Keywords: | Operating System: Unknown/Multiple
Architecture: | Type of failure: None/Unknown
Unknown/Multiple |
Test Case: | Blocked By:
Blocking: | Related Tickets:
Differential Rev(s): | Wiki Page:
-------------------------------------+-------------------------------------
A multithreaded program generated by latest release candidate occasionally
segfaults inside the runtime system. It is always at the same instruction:
{{{
(gdb) bt
#0 0x00007f25ca77fde3 in threadPaused ()
from /nix/store/995xifyvjlbvd138r0gpq008nyxls6hr-
ghc-8.2.0.20170704/lib/ghc-8.2.0.20170704/rts/libHSrts_thr-
ghc8.2.0.20170704.so
#1 0x00007f25ca795068 in stg_returnToSched ()
from /nix/store/995xifyvjlbvd138r0gpq008nyxls6hr-
ghc-8.2.0.20170704/lib/ghc-8.2.0.20170704/rts/libHSrts_thr-
ghc8.2.0.20170704.so
#2 0x0000000000000000 in ?? ()
(gdb) disassemble
Dump of assembler code for function threadPaused:
0x00007f25ca77fda0 <+0>: push %r15
0x00007f25ca77fda2 <+2>: push %r14
0x00007f25ca77fda4 <+4>: push %r13
0x00007f25ca77fda6 <+6>: push %r12
0x00007f25ca77fda8 <+8>: mov %rdi,%r12
0x00007f25ca77fdab <+11>: push %rbp
0x00007f25ca77fdac <+12>: push %rbx
0x00007f25ca77fdad <+13>: mov %rsi,%rbp
0x00007f25ca77fdb0 <+16>: sub $0x28,%rsp
0x00007f25ca77fdb4 <+20>: callq 0x7f25ca77a640
<maybePerformBlockedException>
0x00007f25ca77fdb9 <+25>: cmpw $0x3,0x20(%rbp)
0x00007f25ca77fdbe <+30>: je 0x7f25ca77fe1d

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): I don't suppose you could provide a reproducer for this? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): What might also help is if you could compile your program with `-debug` and paste the output of `x/64a tso->stackobj->sp` after the program crashes. It seems like we are getting confused walking the stack, so it would be interesting to know what the stack looks like. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): Replying to [comment:2 bgamari]:
What might also help is if you could compile your program with `-debug` and paste the output of `x/64a tso->stackobj->sp` after the program crashes. It seems like we are getting confused walking the stack, so it would be interesting to know what the stack looks like.
I did but I still had no symbols in gdb. I must be striping them somewhere in my build... I will take a look and report back. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): Replying to [comment:1 bgamari]:
I don't suppose you could provide a reproducer for this?
It's hard for me to provide a reproducible example since it is in a complex proprietary program and I have of what could be causing this besides that increasing the number of capabilities with -N tends to increase the frequency of the crashes. I might be able to convince my employer to opensource the "engine" (a 2D spread simulation, which we use to simulate wildfires). I will certainly try to use this ticket as an argument in favor of doing that. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Note that I believe `Cabal` strips by default`.
It's hard for me to provide a reproducible example since it is in a complex proprietary program and I have of what could be causing this besides that increasing the number of capabilities with -N tends to increase the frequency of the crashes.
Quite understandable. Anything you can offer would be very much appreciated. I tested this patch with the testcase from #13615 with over a day of runtime without seeing this crash, so it seems that you are tickling something odd. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): If you aren't able to get debug symbols, I believe that the `StgTSO*` argument can be found at `*$rbp`. Consequently, `x/64a ((uint64_t*) $rbp)[0] + 0x10` will likely do the trick. For my future reference: `$rbx` contains `frame`, `$r13` contains `stack_end` -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): Replying to [comment:2 bgamari]:
What might also help is if you could compile your program with `-debug` and paste the output of `x/64a tso->stackobj->sp` after the program crashes. It seems like we are getting confused walking the stack, so it would be interesting to know what the stack looks like.
I finally managed to get debugging symbols. I had to pass {{{dontStrip =
true}}} to the Nix derivation that builds ghc from git. Anyway, it crashed
again at the same point and here's what gdb says:
{{{
(gdb) x/64a tso->stackobj->sp
0x421cdbfea8: 0x7f9b3d2cc5c8

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): Now that I have debugging symbols I can confirm that the segfault occurs in the {{{switch(info->i.type}}} line (224) reported in #9130 {{{ (gdb) bt #0 threadPaused (cap=0xc38ef0, tso=0x421dcc50f0) at rts/ThreadPaused.c:224 #1 0x00007f9b3d2c8275 in stg_returnToSched () from /nix/store/f46shfdh7qmagqw11w61g099jm544fd4-ghc-8.2.0.20170704/lib/ghc-8.2.0.20170704/rts /libHSrts_thr_debug-ghc8.2.0.20170704.so #2 0x0000000000000000 in ?? () }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused
-------------------------------------+-------------------------------------
Reporter: albertov | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone:
Component: Runtime System | Version: 8.2.1-rc3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by albertov):
I've got a SEGFAULT in a new location which seems related to the same
issue:
{{{
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f747043a32f in stg_BLACKHOLE_info ()
from /nix/store/ka5975xi1b7vcw98a1agqhb0y4gxcwbj-
ghc-8.2.0.20170704/lib/ghc-8.2.0.20170704/rts/libHSrts_thr_debug-
ghc8.2.0.20170704.so
[Current thread is 1 (LWP 25315)]
warning: File "/nix/store/xfrkm34sk0a13ha9bpki61l2k5g1v8dh-
gcc-5.4.0-lib/lib/libstdc++.so.6.0.21-gdb.py" auto-loading has been
declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-
load".
(gdb) bt
#0 0x00007f747043a32f in stg_BLACKHOLE_info ()
from /nix/store/ka5975xi1b7vcw98a1agqhb0y4gxcwbj-
ghc-8.2.0.20170704/lib/ghc-8.2.0.20170704/rts/libHSrts_thr_debug-
ghc8.2.0.20170704.so
#1 0x0000000000000000 in ?? ()
(gdb) info locals
No symbol table info available.
(gdb) disassemble
Dump of assembler code for function stg_BLACKHOLE_info:
0x00007f747043a240 <+0>: mov 0x8(%rbx),%rax
0x00007f747043a244 <+4>: test $0x7,%al
0x00007f747043a246 <+6>: jne 0x7f747043a32c

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): I've got a hunch that this might be related to the use of a HashMap (from unordered-containers) inside STM transactions. I'm building my program with ordered maps from containers to see if it makes a difference. It'll take a while for the results since Nix has decided that it must rebuild GHC from my local checkout. If this is the case I think I should be able to provide a reproducible case. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): My hunch was wrong. Substituting HashMap for Map in the variable stored in a TMVar which threads contend for did not eliminate the segfaults. It made them much more frequent however. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused
-------------------------------------+-------------------------------------
Reporter: albertov | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone:
Component: Runtime System | Version: 8.2.1-rc3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by albertov):
Another interesting segfault, again, I believe related to the same root
issue:
{{{
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f16a46dc18b in cas (n=<optimized out>, o=596035667054922568,
p=0x7f16a46fce88

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by hsyl20): GHC uses [https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/SymbolNames z-encoding], hence "zd" is "$" and "zu" is "_" . You can remove the `$w` prefix probably introduced by the worker/wrapper transformation, leaving you with `poly_go13` or maybe `poly_go` in the Haskell source. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

GHC uses [https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/SymbolNames z-encoding], hence "zd" is "$" and "zu" is "_" . You can remove the `$w`
#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): Replying to [comment:13 hsyl20]: prefix probably introduced by the worker/wrapper transformation, leaving you with `poly_go13` or maybe `poly_go` in the Haskell source. There's no function by those names in that module, however, I could manage to find "poly_go13" it's header (.hi) file. It's in a binary format which gives me no hints regarding where it might come from. Is there any way to inspect it in a more amicable way? Thanks! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): You can use the `ghc` option `--show-iface` to inspect the `.hi` file. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Also, using the gdb command `info line sigym4zmpropagzmenginezm0zi1zi0zi0zmFERo4wFJ8F465LTBnpBC6B_Sigym4ziPropagziEngine_zdwpolyzugo13_info` may also be helpful. I generally compile my programs with `-ddump-simpl -ddump-stg -ddump-opt- cmm -ddump-to-file` when looking at problems so I can refer back to what GHC was working with. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Replying to [comment:13 hsyl20]:
GHC uses [https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/SymbolNames z-encoding], hence "zd" is "$" and "zu" is "_" . You can remove the `$w`
#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): Replying to [comment:14 albertov]: prefix probably introduced by the worker/wrapper transformation, leaving you with `poly_go13` or maybe `poly_go` in the Haskell source.
There's no function by those names in that module, however, I could
manage to find "poly_go13" it's header (.hi) file. It's in a binary format
which gives me no hints regarding where it might come from. Is there any
way to inspect it in a more amicable way? Thanks!
I've found out by compiling with {{{-ddump-simpl}}} that {{{poly_go13}}}
seems to be a specialization of {{{Data.Map.lookup}}}:
{{{
$wpoly_go13 [InlPrag=[0], Occ=LoopBreaker]
:: forall a. Int# -> Int# -> Map BlockIndex a -> Maybe a
[GblId, Arity=3, Caf=NoCafRefs, Str=]
$wpoly_go13
= \ (@ a) (ww :: Int#) (ww1 :: Int#) (w :: Map BlockIndex a) ->
case w of {
Bin ipv ipv1 ipv2 ipv3 ipv4 ->
case ipv1 of { V2 b1 b2 ->
case b1 of { I# y# ->
case b2 of { I# y#1 ->
case tagToEnum# @ Bool (<# ww y#) of {
False ->
case tagToEnum# @ Bool (==# ww y#) of {
False -> $wpoly_go13 @ a ww ww1 ipv4;
True ->
case tagToEnum# @ Bool (<# ww1 y#1) of {
False ->
case tagToEnum# @ Bool (==# ww1 y#1) of {
False -> $wpoly_go13 @ a ww ww1 ipv4;
True -> Just @ a ipv2
};
True -> $wpoly_go13 @ a ww ww1 ipv3
}
};
True -> $wpoly_go13 @ a ww ww1 ipv3
}
}
}
};
Tip -> Nothing @ a
}
end Rec }
}}}
where
{{{
type BlockIndex = Linear.V2.V2 Int
}}}
I can relate to the part of my program where this comes from and,
interestingly, this was originally a {{{Data.HashMap.Strict.HashMap}}}
which I changed to a {{{Data.Map.Strict.Map}}} to see if it made a
difference (see comment:10). For some reason, segfaults are much more
frequent with the Data.Map (so I've left it like this to help debug).
This Map is stored in a TMVar which threads regularly {{{M.lookup blockIx
<$> readTMVar}}} neighboring Blocks to send them "work". When the lookup
fails they take the lock, create a new Block and a new thread to process
it, put the Block back in the Map and put the Map back in the TMVar. So,
although the BlockIndex is strict, the value isn't so perhaps this is
where shared un-evaluated thunks are created which manifests the bug? (if
my intuition about the problem is correct).
Anyway, now I have some ideas on how to attempt to reproduce this bug
outside of my program, which might be quicker than factoring out the whole
engine out of the propietary parts.
--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:17
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari):
Anyway, now I have some ideas on how to attempt to reproduce this bug outside of my program, which might be quicker than factoring out the whole engine out of the propietary parts.
I agree; it would be great if you could extract the essence of the issue into an independent repro. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): After removing many dependencies and some cleanup we've opensourced the engine of the problematic wildfire simulator. Fortunately the bug still manifests itself (which rules out some kind of nasty interaction with the foreign libraries it linked to). To reproduce clone https://github.com/meteogrid/propag , `cabal new-build` it and run the `propag-demo` executable. Using the rc3 pre-release and running with `+RTS -N12` it crashed 100% of the times I've tried (the refactor has increased the change of crash). Sorry for this monster of a reproducible case. It's the best I could do to minimize the amount of third-party dependencies for the time being. I'm planning to remove the dependency on the wildfire-specific stuff which should remove more dependencies (and make it useful as an arbitrary 2D spread automata). I've tried initially to reproduce it starting from scratch but did not succeed so I pulled out all the bigger dependencies (eg: meteogrid/bindings-gdal) which require bindings to c++ libraries. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Thanks albertov! I'm building your repro as we speak. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): Replying to [comment:20 bgamari]:
Thanks albertov! I'm building your repro as we speak.
Thanks! Did you manage to reproduce the segfault? By the way, I forgot to warn you that this creates a temporary work directory using `System.IO.Temp.withSystemTempDirectory` named `propag-work-XXXX`. It is mean to be cleared after running but if it crashes it wont be so make sure you clean them as they can add up to a lot of space after many experiments. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): I have managed to reproduce it a few times although it does seem to take a while. Moreover, it (perhaps not surprisingly) is quite dependent upon paralellism. I was unable to reproduce it in a reasonable amount of time on my dual-core, four-thread laptop. However, on my 4-core, eight-thread server I was able to reproduce it within ten minutes or so. How many cores does your test environment have? Also, for the sake of My first step is to test whether the fix to #13615 is to blame; the problem indeed appears to be a race condition and instrumenting the code can easily hide the problem. For instance, I was unable to reproduce the issue even once while running the program under `rr` overnight. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I have managed to reproduce it a few times although it does seem to take a while. Moreover, it (perhaps not surprisingly) is quite dependent upon
#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): Replying to [comment:22 bgamari]: paralellism. I was unable to reproduce it in a reasonable amount of time on my dual-core, four-thread laptop. However, on my 4-core, eight-thread server I was able to reproduce it within ten minutes or so. How many cores does your test environment have? I'm testing on a 6-core, 12-thread machine. It is indeed hard to reproduce although it is no much more frequent. I've noticed that giving a large `-N` value to the RTS increases the odds. I'm testing with `-N30`. Have you tried increasing this value way over your number of cores? Another interesting thing is that without this [https://github.com/meteogrid/propag/blob/master/app/Main.hs#L69 pause] I was unable to reproduce the crash and noticed that a breakpoint I had set at `suspendComputation` was never being hit. Adding this pause causes `suspendComputation` to be called multiple times and eventually manifests the problem (when running outside gdb). Maybe playing with that pause helps increasing the odds in your environment? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): Reducing the block size [https://github.com/meteogrid/propag/blob/master/app/Main.hs#L33 here] (to an even number) should also help increasing the odds of a crash since the program will create more (non-OS) threads -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): I have confirmed that c1c0985416a6f9766c03d361449f556905bf8e1d really is the first bad commit. I also noticed that it is possible to reproduce the crash using `forkIO` instead of `forkOS`, which makes it a bit easier to debug. Presumably this is safe since there are no native dependencies. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: patch Priority: highest | Milestone: 8.2.1 Component: Runtime System | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * priority: normal => highest * status: new => patch * milestone: => 8.2.1 Comment: I found the issue. I neglected to consider that the stack-pointer adjustment in the `AP_STACK` entry code also accounted for the words that we would later copy from the applied stack to the current thread's stack. Since the stack-pointer adjustment happened before we attempted to blackhole the `AP_STACK` closure, there was a small chance that we would suspend the thread with uninitialized content on its stack (specifically, if another thread beat us to blackholing the closure). This should be fixed by Phab:D3760. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused
-------------------------------------+-------------------------------------
Reporter: albertov | Owner: (none)
Type: bug | Status: patch
Priority: highest | Milestone: 8.2.1
Component: Runtime System | Version: 8.2.1-rc3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: closed Priority: highest | Milestone: 8.2.1 Component: Runtime System | Version: 8.2.1-rc3 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: patch => closed * resolution: => fixed Comment: Merged to `ghc-8.2` with ffea6cfe7137093c32cd2357fb9fdf9db9430543. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: closed Priority: highest | Milestone: 8.2.1 Component: Runtime System | Version: 8.2.1-rc3 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by albertov): Amazing. I'm building a new ghc right now to test it. Sorry I couldn't do it before, I didn't expect it to be fixed so quickly! Thanks bgamari! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: closed Priority: highest | Milestone: 8.2.1 Component: Runtime System | Version: 8.2.1-rc3 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mnislaih): Hey, thanks a lot to everyone who reported, worked and ultimately fixed this bug. We (Barclays) only got around to test RC3 yesterday and found that it was nigh impossible to build our code base with it in our Windows environment. Today I built GHC from the 8.2 branch with this fix included and it's all good again. Thanks again! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13970: Segmentation fault inside threadPaused -------------------------------------+------------------------------------- Reporter: albertov | Owner: (none) Type: bug | Status: closed Priority: highest | Milestone: 8.2.1 Component: Runtime System | Version: 8.2.1-rc3 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): I'm glad it helped! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13970#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC