Help debugging a deadlock in salvia on GHC 6.10 i386

I'm working on trimming down the test code and filing a real bug. I'm going to list out what I know right now and if anything jumps out please let me know. Thanks! I'm running a webserver built using salvia [1] and GHC 6.10 [2]. I've trimmed down the code enough such that there is no obvious source of a deadlock in either salvia or the reset of the web server. I don't have any specific conditions that reproduce the issue as well. Just after some time, anywhere from a few minutes to a few hours, the server deadlocks. No particular request or number of requests seem to trigger the deadlock. 1) Salvia accepts connections on the main thread then forkIOs a new thread to actually handle the request. The new thread uses Handle based IO. 2) As I understand it, there are issues with forkProcess and Handle based IO. While this is a web server I'm avoiding using "daemonize" code that relies on forkProcess. no forkProcess is occurring that I know of. 3) The thread state summary printed by calling printAllThreads() from GDB is: all threads: threads on capability 0: other threads: thread 2 @ 0xb7d66000 is blocked on an MVar @ 0xb7d670b4 thread 3 @ 0xb7d74214 is blocked on an MVar @ 0xb7da88f0 4) The thread states according to a "thread apply all bt" from GDB is: 1. GDB backtrace Thread 4 (Thread 0xb7cffb90 (LWP 30891)): #0 0xb8080416 in __kernel_vsyscall () #1 0xb7fd0075 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/i686/cmov/libpthread.so.0 #2 0x083f4320 in waitCondition (pCond=0x9a7cc1c, pMut=0x9a7cc4c) at posix/OSThreads.c:65 #3 0x0840de64 in yieldCapability (pCap=0xb7cff36c, task=0x9a7cc00) at Capability.c:506 #4 0x083eb292 in schedule (initialCapability=0x8565aa0, task=0x9a7cc00) at Schedule.c:293 #5 0x083ed5ff in workerStart (task=0x9a7cc00) at Schedule.c:1923 #6 0xb7fcc50f in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #7 0xb7f49a0e in clone () from /lib/tls/i686/cmov/libc.so.6 Thread 3 (Thread 0xb74feb90 (LWP 30892)): #0 0xb8080416 in __kernel_vsyscall () #1 0xb7fd0075 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/i686/cmov/libpthread.so.0 #2 0x083f4320 in waitCondition (pCond=0x9a7ef3c, pMut=0x9a7ef6c) at posix/OSThreads.c:65 #3 0x0840de64 in yieldCapability (pCap=0xb74fe36c, task=0x9a7ef20) at Capability.c:506 #4 0x083eb292 in schedule (initialCapability=0x8565aa0, task=0x9a7ef20) at Schedule.c:293 #5 0x083ed5ff in workerStart (task=0x9a7ef20) at Schedule.c:1923 #6 0xb7fcc50f in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #7 0xb7f49a0e in clone () from /lib/tls/i686/cmov/libc.so.6 Thread 2 (Thread 0xb6cfdb90 (LWP 30916)): #0 0xb8080416 in __kernel_vsyscall () #1 0xb7fd0075 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/i686/cmov/libpthread.so.0 #2 0x083f4320 in waitCondition (pCond=0x9a7e12c, pMut=0x9a7e15c) at posix/OSThreads.c:65 #3 0x0840de64 in yieldCapability (pCap=0xb6cfd36c, task=0x9a7e110) at Capability.c:506 #4 0x083eb292 in schedule (initialCapability=0x8565aa0, task=0x9a7e110) at Schedule.c:293 #5 0x083ed5ff in workerStart (task=0x9a7e110) at Schedule.c:1923 #6 0xb7fcc50f in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #7 0xb7f49a0e in clone () from /lib/tls/i686/cmov/libc.so.6 Thread 1 (Thread 0xb7e666b0 (LWP 30890)): #0 0xb8080416 in __kernel_vsyscall () #1 0xb7fd0075 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/i686/cmov/libpthread.so.0 #2 0x083f4320 in waitCondition (pCond=0x9a7cb3c, pMut=0x9a7cb6c) at posix/OSThreads.c:65 #3 0x0840de64 in yieldCapability (pCap=0xbfa822ac, task=0x9a7cb20) at Capability.c:506 #4 0x083eb292 in schedule (initialCapability=0x8565aa0, task=0x9a7cb20) at Schedule.c:293 #5 0x083ed463 in scheduleWaitThread (tso=0xb7d80800, ret=0x0, cap=0x8565aa0) at Schedule.c:1895 #6 0x083e851a in rts_evalLazyIO (cap=0x8565aa0, p=0x8489478, ret=0x0) at RtsAPI.c:517 #7 0x083e79d5 in real_main () at Main.c:111 Anybody think of anything so far? Cheers, Corey O'Connor

On Sat, Jun 6, 2009 at 2:09 PM, Corey O'Connor
I'm running a webserver built using salvia [1] and GHC 6.10 [2]. I've trimmed down the code enough such that there is no obvious source of a deadlock in either salvia or the reset of the web server. I don't have any specific conditions that reproduce the issue as well. Just after some time, anywhere from a few minutes to a few hours, the server deadlocks. No particular request or number of requests seem to trigger the deadlock.
I've narrowed down the issue to be related to the use of System.Timeout.timeout. Without the use of the timeout combinator the server does not hit a deadlock condition. With the use of the combinator the server eventually deadlocks. I'll look into if any of salvia's threads would be aversely effected by a timeout. Are there any known issues with the timeout implementation besides the (reasonable) inability to timeout FFI calls? Cheers, Corey O'Connor

On 13/06/2009 17:21, Corey O'Connor wrote:
On Sat, Jun 6, 2009 at 2:09 PM, Corey O'Connor
wrote: I'm running a webserver built using salvia [1] and GHC 6.10 [2]. I've trimmed down the code enough such that there is no obvious source of a deadlock in either salvia or the reset of the web server. I don't have any specific conditions that reproduce the issue as well. Just after some time, anywhere from a few minutes to a few hours, the server deadlocks. No particular request or number of requests seem to trigger the deadlock.
I've narrowed down the issue to be related to the use of System.Timeout.timeout. Without the use of the timeout combinator the server does not hit a deadlock condition. With the use of the combinator the server eventually deadlocks.
I'll look into if any of salvia's threads would be aversely effected by a timeout. Are there any known issues with the timeout implementation besides the (reasonable) inability to timeout FFI calls?
I don't know of any current issues, but historically there have been several bugs in this area. System.Timeout uses throwTo, which is devilishly difficult to get right. I assume you're using GHC 6.10.3? If you could compile your server with -debug and capture the output when running it with +RTS -Ds, that might help us diagnose the problem. Cheers, Simon
participants (2)
-
Corey O'Connor
-
Simon Marlow