
Hi Simon, It seems that setnumcapabilities001 still occassionally fails, although this time by a different mode: https://phabricator.haskell.org/harbormaster/build/14485/?l=100 Cheers, - Ben

How many cores does the builder machine have? (this should make it easier
for me to repro)
On 25 October 2016 at 16:56, Ben Gamari
Hi Simon,
It seems that setnumcapabilities001 still occassionally fails, although this time by a different mode: https://phabricator.haskell.org/harbormaster/build/14485/?l=100
Cheers,
- Ben

Briefly looking at the code it seems like several global variables involved
should be volatile: n_capabilities, enabled_capabilities, and
capabilities. Perhaps in a loop like in scheduleDoGC the compiler moves
the reads of n_capabilites or capabilites outside the loop. A failed
requestSync in that loop would not get updated values for those global
pointers. That particular loop isn't doing that optimization for me, but I
think it could happen without volatile.
Ryan
On Thu, Oct 27, 2016 at 9:18 AM, Ben Gamari
Simon Marlow
writes: I haven't been able to reproduce the failure yet. :(
Indeed I've also not seen it in my own local builds. It's quite an fragile failure.
Cheers,
- Ben
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Hi Ryan, I don't think that's the issue. Those variables can only be
modified in setNumCapabilities, which acquires *all* the capabilities
before it makes any changes. There should be no other threads running RTS
code(*) while we change the number of capabilities. In particular we
shouldn't be in releaseGCThreads while enabled_capabilities is being
changed.
(*) well except for the parts at the boundary with the external world which
run without a capability, such as rts_lock() which acquires a capability.
Cheers
Simon
On 27 Oct 2016 17:10, "Ryan Yates"
Briefly looking at the code it seems like several global variables involved should be volatile: n_capabilities, enabled_capabilities, and capabilities. Perhaps in a loop like in scheduleDoGC the compiler moves the reads of n_capabilites or capabilites outside the loop. A failed requestSync in that loop would not get updated values for those global pointers. That particular loop isn't doing that optimization for me, but I think it could happen without volatile.
Ryan
On Thu, Oct 27, 2016 at 9:18 AM, Ben Gamari
wrote: Simon Marlow
writes: I haven't been able to reproduce the failure yet. :(
Indeed I've also not seen it in my own local builds. It's quite an fragile failure.
Cheers,
- Ben
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Right, it is compiler effects at this boundary that I'm worried about,
values that are not read from memory after the changes have been made, not
memory effects or data races.
On Fri, Oct 28, 2016 at 3:02 AM, Simon Marlow
Hi Ryan, I don't think that's the issue. Those variables can only be modified in setNumCapabilities, which acquires *all* the capabilities before it makes any changes. There should be no other threads running RTS code(*) while we change the number of capabilities. In particular we shouldn't be in releaseGCThreads while enabled_capabilities is being changed.
(*) well except for the parts at the boundary with the external world which run without a capability, such as rts_lock() which acquires a capability.
Cheers Simon
On 27 Oct 2016 17:10, "Ryan Yates"
wrote: Briefly looking at the code it seems like several global variables involved should be volatile: n_capabilities, enabled_capabilities, and capabilities. Perhaps in a loop like in scheduleDoGC the compiler moves the reads of n_capabilites or capabilites outside the loop. A failed requestSync in that loop would not get updated values for those global pointers. That particular loop isn't doing that optimization for me, but I think it could happen without volatile.
Ryan
On Thu, Oct 27, 2016 at 9:18 AM, Ben Gamari
wrote: Simon Marlow
writes: I haven't been able to reproduce the failure yet. :(
Indeed I've also not seen it in my own local builds. It's quite an fragile failure.
Cheers,
- Ben
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

I see, but the compiler has no business caching things across
requestSync(), which can in principle change anything: even if the compiler
could see all the code, it would find a pthread_condwait() in there.
Anyway I've found the problem - it was caused by a subsequent GC
overwriting the values of gc_threads[].idle before the previous GC had
finished releaseGCThreads() which reads those values. Diff on the way...
Cheers
Simon
On 28 October 2016 at 11:58, Ryan Yates
Right, it is compiler effects at this boundary that I'm worried about, values that are not read from memory after the changes have been made, not memory effects or data races.
On Fri, Oct 28, 2016 at 3:02 AM, Simon Marlow
wrote: Hi Ryan, I don't think that's the issue. Those variables can only be modified in setNumCapabilities, which acquires *all* the capabilities before it makes any changes. There should be no other threads running RTS code(*) while we change the number of capabilities. In particular we shouldn't be in releaseGCThreads while enabled_capabilities is being changed.
(*) well except for the parts at the boundary with the external world which run without a capability, such as rts_lock() which acquires a capability.
Cheers Simon
On 27 Oct 2016 17:10, "Ryan Yates"
wrote: Briefly looking at the code it seems like several global variables involved should be volatile: n_capabilities, enabled_capabilities, and capabilities. Perhaps in a loop like in scheduleDoGC the compiler moves the reads of n_capabilites or capabilites outside the loop. A failed requestSync in that loop would not get updated values for those global pointers. That particular loop isn't doing that optimization for me, but I think it could happen without volatile.
Ryan
On Thu, Oct 27, 2016 at 9:18 AM, Ben Gamari
wrote: Simon Marlow
writes: I haven't been able to reproduce the failure yet. :(
Indeed I've also not seen it in my own local builds. It's quite an fragile failure.
Cheers,
- Ben
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
participants (3)
-
Ben Gamari
-
Ryan Yates
-
Simon Marlow