
Am Samstag 13 März 2010 17:36:49 schrieb Michael Lesniak:
Hello,
In one of my example programs I have a strange behaviour: it is a very simple taskpool using STM; in pseudocode it's
1. generate data structures 2. initialize data structures 3. fork threads 4. wait (using STM) until the pool is empty and all threads are finished 5. print a final message
In very few cases, which depend on the number of threads spawned, the program hangs *after* the final message of step 5 has been printed. "Few cases" means, for example, 50.000 good, terminating runs before it hangs. If you increment the number of spawned threads (to a few hundred or thousands), it hangs much faster. Since forked threads terminate after the main thread terminates (which it should after printing the message), this behaviour is quite unexpected.
I won't pretend I really understand what's going on, but it seems that occasionally a couple of threads are caught in a retry-loop. Having each thread print out its ThreadId after it's done, when it hangs, only one thread says it's done. I don't see how that could happen, but that's what I found. For the attached programme, in the task-getting, else if Set.null work then return Nothing else retry doesn't really make sense, when the channel is empty, we could return Nothing right away. I suppose, in the real programme, some threads might write further tasks to the channel, so while not all threads have finished, the channel might not be permanently empty? If not, "return Nothing" whenever the channel is empty ought to reliably end all threads and prevent hanging. If yes, writing strict values to working: get chan working = do tid <- myThreadId -- atomically commit that this thread is not working anymore (since we -- try to get a task we must be quasi-idle! atomically $ do work <- Set.delete tid `fmap` readTVar working writeTVar working $! work -- waits for a new task. if all threads are idle and the pool is empty, -- return. atomically $ do empty <- isEmptyTChan chan work <- readTVar working if (not empty) then do task <- readTChan chan writeTVar working $! (Set.insert tid work) return (Just task) else if Set.null work then return Nothing else retry seems to prevent hanging on my box (running fine with "100 64 1 +RTS -N" nearing task 60000, without the strict writes it typically hangs after a few dozen or hundred runs). I think the strict write in "writeTVar working $! (Set.insert tid work)" isn't necessary, but I haven't yet tested it. Why writing a thunk in atomically $ do work <- Set.delete tid `fmap` readTVar working writeTVar working work should cause it to hang sometimes, I've no idea. Nor whether that really fixes it or it's just a fluke.
Since I've experienced strange behaviour in the past which was the fault of my system configuration[1], I am a bit cautious before reporting a bug on GHC's bugtracker, especially since its reproduction is so difficult and random.
So my question is how much circumspection is expected/needed before one should enter a bug in the bug tracker? I've tested the attached code on three different systems (with different linux systems, but always GHC 6.12.1 (since it's a bit costly to install the older versions)) and observed the mentioned behaviour. Is this enough to justify a bug report? Or, on the other hand, could someone spot the
I'd ask such things on glasgow-haskell-users, less traffic, it's a GHC- specific list, you're more likely that one of the GHC experts notices it there and can tell you whether it's a bug, a feature or an error in your code.
error in the attached code. Given my history with strange parallel behaviour, I am much more sure that it's the fault of my code, but I can't spot the error and the described behaviour (halting *after* the final message) is really strange.
Cheers, Michael
[1] http://www.haskell.org/pipermail/haskell-cafe/2010-March/073938.html