
Donald Bruce Stewart wrote:
ninegua:
replying to my own message... the behavior is only when -O is used during compilation, otherwise they both run on 2 cores but at a much lower (1/100) speed.
Hmm, any change with -O2? Is the optimiser changing the code such that the scheduler doesn't get to switch threads as often? If you change the thread scheduler switching rate does that change anything?
See the GHC user's guide for more details:
7.12.1.3.�Scheduling policy for concurrent threads
Runnable threads are scheduled in round-robin fashion. Context switches are signalled by the generation of new sparks or by the expiry of a virtual timer (the timer interval is configurable with the -C[<num>] RTS option). However, a context switch doesn't really happen until the current heap block is full. You can't get any faster context switching than this.
When a context switch occurs, pending sparks which have not already been reduced to weak head normal form are turned into new threads. However, there is a limit to the number of active threads (runnable or blocked) which are allowed at any given time. This limit can be adjusted with the -t <num> RTS option (the default is 32). Once the thread limit is reached, any remaining sparks are deferred until some of the currently active threads are completed.
I think you got that from an old version of the users's guide - it certainly isn't in the 6.6.1 or HEAD versions of the docs. I don't have any specific advice about the program in this thread, but in my (limited) experience with debugging parallelism problems in GHC, these are common: (a) the child threads aren't doing any work, just accumulating a large thunk which gets evaluated by the main thread sequentially. (b) you have a sequential dependency somewhere (c) tight loops that don't allocate don't give the scheduler a chance to run and load-balance. (d) GHC's scheduler is too stupid I doubt that (c) is a problem for you: it normally occurs when you try to use par/seq and strategies, and are playing with parallel fibonacci. Here you are using forkIO which definitely allocates, so that shouldn't be a problem. (d) is quite possible. I once tried to parallelise the simple concurrency example from the language shootout, which essentially consists of a long chain of threads with data items being passed along the chain. I could only get any kind of speedup when I fixed half the chain on to each CPU, rather than using the automatic migration logic in the scheduler. You can use GHC.Conc.forkOn for this: forkOnIO :: Int -> IO () -> IO ThreadId pass it an integer T, and the thread will be stuck to CPU T `mod` N (where N is the number of CPUs). The RTS doesn't really phyisically fix its execution units to CPUs, but usually the OS manages to do a reasonable job of this. In GHC 6.8, hopefully we'll have some better tools for debugging parallelism performance problems. Michael Adams (who just finished an internship here at MSR) ported some of the GranSim visualisation tools to the current GHC, I have the patches sitting in my inbox ready to review. Cheers, Simon