debugging a hanging program: where to start?

Hi Cafe, I have a program that I can reliably cause to hang. It's concurrent using STM, so I think it could be a deadlock or related issue. I also do some IO, so I think it could be blocking in a system call. It only hangs when compiled with -threaded. I tried building with -prof, and running with -hc -xt to get a clue where in which function it is stopping, but the resulting profile didn't help much. I've started sprinkling print statements, but I thought I'd trawl for better tips and techniques. Any suggestions on approaches? Thanks, Aran

On Wed, May 12, 2010 at 8:15 PM, Aran Donohue
Hi Cafe,
I have a program that I can reliably cause to hang. It's concurrent using STM, so I think it could be a deadlock or related issue. I also do some IO, so I think it could be blocking in a system call. It only hangs when compiled with -threaded. I tried building with -prof, and running with -hc -xt to get a clue where in which function it is stopping, but the resulting profile didn't help much.
Usually it's the other way around (hangs without threaded). So, I would guess that you're experiencing some sort of deadlock issue.
I've started sprinkling print statements, but I thought I'd trawl for better tips and techniques. Any suggestions on approaches?
Try threadscope: http://research.microsoft.com/en-us/projects/threadscope/ I haven't used it myself but I've heard it's *the* tool for investigating parallel program behavior for haskell programs. Hope that helps, Jason

Aran Donohue
I have a program that I can reliably cause to hang. It's concurrent using STM, so I think it could be a deadlock or related issue. I also do some IO, so I think it could be blocking in a system call.
If it's the latter, 'strace' might help you. Use 'strace -p PID' to attach to a running process. Similarly, 'ltrace' can trace library calls (but probably less useful in this context?) (This is on Linux, but other OSes are likely to have similar tools.) -k -- If I haven't seen further, it is by standing in the footprints of giants

Thanks folks! Forward progress is made...
Unfortunately, programs don't seem to write out their threadscope event logs
until they terminate, and mine hangs until I kill it, so I can't get at the
event log.
Tracing has taught me that before the hang-cause, my program splits its time
in pthread_cond_wait in two different threads, and select in a third. After
the hang, it no longer calls select and one of those pthread_cond_waits in
the other. In the version without -threaded that doesn't hang, it never does
any pthread_cond_wait and never misses the select.
Now to go figure out what impossible condition it's waiting on, I guess.
Aran
On Thu, May 13, 2010 at 2:13 AM, Ketil Malde
Aran Donohue
writes: I have a program that I can reliably cause to hang. It's concurrent using STM, so I think it could be a deadlock or related issue. I also do some IO, so I think it could be blocking in a system call.
If it's the latter, 'strace' might help you. Use 'strace -p PID' to attach to a running process. Similarly, 'ltrace' can trace library calls (but probably less useful in this context?)
(This is on Linux, but other OSes are likely to have similar tools.)
-k -- If I haven't seen further, it is by standing in the footprints of giants

On Thu, May 13, 2010 at 5:53 AM, Aran Donohue
Thanks folks! Forward progress is made...
Unfortunately, programs don't seem to write out their threadscope event logs until they terminate, and mine hangs until I kill it, so I can't get at the event log.
Tracing has taught me that before the hang-cause, my program splits its time in pthread_cond_wait in two different threads, and select in a third. After the hang, it no longer calls select and one of those pthread_cond_waits in the other. In the version without -threaded that doesn't hang, it never does any pthread_cond_wait and never misses the select.
Now to go figure out what impossible condition it's waiting on, I guess.
The select sounds like the IO manager thread (a thread in the RTS not your code). Is it possible that one of your threads does work but never allocates memory? I've heard in some cases that can lead to starvation. I think the explanation was that thread switching happens on allocation? Jason

I have an accept-loop:
do (conn, _saddr) <- accept sock
forkIO $ initializeConnection conn
Which allocates memory iff accept allocates, I suppose. To test the theory,
is there a way I can force an allocation that won't get optimized away?
According to the old print-statement debugging method, it is accept that
causes the problem. The accept is the last thing that happens before the
hang.
I created a test program to hammer on accept.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Concurrent (forkIO)
import Control.Concurrent.MVar
import Control.Monad (forever, liftM)
import Network (listenOn, withSocketsDo, PortID(..))
import Network.BSD (getProtocolNumber)
import Network.Socket (Family(..), SockAddr(..), SocketType(..), Socket(..),
accept, connect, inet_addr, socket, sClose)
import Network.Socket.ByteString
import Data.ByteString.Char8 ()
import System.Environment (getArgs)
import System.IO (stdout, hFlush)
port = 8080
main = do
args <- getArgs
if length args < 1 || (head args /= "client" && head args /= "server")
then putStrLn "say client or server"
else withSocketsDo $ case head args of
"client" -> client
"server" -> server
client = do
putStrLn "Client mode"
tcp <- getProtocolNumber "tcp"
forever $ do sock <- socket AF_INET Stream tcp
localhost <- inet_addr "127.0.0.1"
putStr "Making a connection..."
connect sock (SockAddrInet port localhost)
send sock "a"
_ <- recv sock 1
sClose sock
putStrLn " done client connection"
server = do
putStrLn "Server mode"
sock <- listenOn (PortNumber port)
count <- newMVar 0
forever $ do count' <- modifyMVar count (\c -> return (c+1,c+1))
putStr $ (show count') ++ " About to accept..."
(conn, _saddr) <- accept sock
putStrLn " accepted."
hFlush stdout
forkIO $ handleServerConnection conn count'
handleServerConnection conn count = do
putStr $ (show count) ++ " Handling a server connection..."
rd <- recv conn 1
send conn "a"
sClose conn
putStrLn " done server connection."
hFlush stdout
While it doesn't seem to fully hang, it does go through mysterious "hiccups"
in which an accept call takes many seconds to return. I'm on Mac OS X Snow
Leopard, ghc 6.12.1.
I'm not quite sure how I ought to proceed. I'm still very open to debugging
tools and techniques I could use to approach the problem!
Aran
On Thu, May 13, 2010 at 12:10 PM, Jason Dagit
On Thu, May 13, 2010 at 5:53 AM, Aran Donohue
wrote: Thanks folks! Forward progress is made...
Unfortunately, programs don't seem to write out their threadscope event logs until they terminate, and mine hangs until I kill it, so I can't get at the event log.
Tracing has taught me that before the hang-cause, my program splits its time in pthread_cond_wait in two different threads, and select in a third. After the hang, it no longer calls select and one of those pthread_cond_waits in the other. In the version without -threaded that doesn't hang, it never does any pthread_cond_wait and never misses the select.
Now to go figure out what impossible condition it's waiting on, I guess.
The select sounds like the IO manager thread (a thread in the RTS not your code). Is it possible that one of your threads does work but never allocates memory? I've heard in some cases that can lead to starvation. I think the explanation was that thread switching happens on allocation?
Jason

On Thursday 13 May 2010 21:28:21, Aran Donohue wrote:
I have an accept-loop:
do (conn, _saddr) <- accept sock forkIO $ initializeConnection conn
Which allocates memory iff accept allocates, I suppose. To test the theory, is there a way I can force an allocation that won't get optimized away?
t <- getCPUTime let n = t `rem` 12345 s = sum [1 .. n] s `seq` doSomething could work. Another option is to insert a threadDelay 0 in your forever loops to enable context-switching if some other thread is runnable.
I'm not quite sure how I ought to proceed. I'm still very open to debugging tools and techniques I could use to approach the problem!
Aran

The memory allocation / threadDelay 0 has no effect, so it isn't *that*
bug.
I've noticed something new, too. I have some other sleeping threads in the
system. When those sleep for short threadDelays compared to the frequency of
accepts, the problem is accelerated. However when the other threads have
threadDelays that go longer than the accept frequency, the problem doesn't
seem to occur.
That is, it seems that having pending thread-awakenings prevents the issue.
I'm stumped!
On Thu, May 13, 2010 at 3:51 PM, Daniel Fischer
On Thursday 13 May 2010 21:28:21, Aran Donohue wrote:
I have an accept-loop:
do (conn, _saddr) <- accept sock forkIO $ initializeConnection conn
Which allocates memory iff accept allocates, I suppose. To test the theory, is there a way I can force an allocation that won't get optimized away?
t <- getCPUTime let n = t `rem` 12345 s = sum [1 .. n] s `seq` doSomething
could work.
Another option is to insert a threadDelay 0 in your forever loops to enable context-switching if some other thread is runnable.
I'm not quite sure how I ought to proceed. I'm still very open to debugging tools and techniques I could use to approach the problem!
Aran
participants (4)
-
Aran Donohue
-
Daniel Fischer
-
Jason Dagit
-
Ketil Malde