
Hi, On experimenting with concurrency, I got somewhat surprising result. I have a program ('flower') that reads an input file and generates one or more output files. As the different output files are independent, I construct one IO action for each output requested on the command line, and just forkIO one thread for each. This seems to work fairly well on one CPU, so I decided to try on multiple CPUs, using +RTS -N. To my surprise, this made the program take several times longer (wall clock). So I tried -N2 and -N4 to see how that turned out. Results are, in minutes: format def -N2 -N4 -N i 0 0 0 0 q 2 2 2 13 f 2 5 1 14 h 8 10 2 27 s 26 17 11 - T 37 23 16 - F 47 38 42 - CPU 2543u 3027u (lost) - 36s 593s 92% 161% I had to terminate -N after five hours wall time and 158026.68s user 60682.76s system 1210% CPU. So, well, it seems this scales okay as long as there are enough threads, but scales *horribly* when you run more threads than processes. Is this a correct assessment? Would it make sense to simply cap -N to the number of forked threads? I guess I should try this with GHC7, but is there reason to believe it will perform better? -k -- If I haven't seen further, it is by standing in the footprints of giants