FW: [Haskell-cafe] The RTSOPTS "-qm" flag's impact on runtime

Simon: did you see this? A factor of 50 in runtime seems pretty significant! Simon -----Original Message----- From: Haskell-Cafe [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Iustin Pop Sent: 30 September 2013 23:14 To: Haskell Cafe Subject: [Haskell-cafe] The RTSOPTS "-qm" flag's impact on runtime Hi all, I found an interesting case where the rtsopts -qm flag makes a significant difference in runtime (~50x). This is using GHC 7.6.3, llvm 3.4, program compiled with "-threaded -O2 -fllvm" and a couple of language extension. Source is at http://benchmarksgame.alioth.debian.org/u64q/benchmark.php?test=chameneosredux&lang=ghc&id=4&data=u64q, on the language shootout benchmarks. Running the code without -N results (on my computer) in around 4 seconds of runtime: $ time ./orig 6000000 … real 0m3.919s user 0m3.903s sys 0m0.010s This is reasonably consistent. Running -N4 (this is an 8-core machine) results in the surprising: $ time ./orig 6000000 +RTS -N4 … real 1m15.154s user 1m38.790s sys 2m7.947s The cores are all used very erratically (continuously changing 5%-20%-40%) and the overall cpu usage is ~27-28%. Note the surprising 2m7s of sys usage, which means the kernel is involved a lot… Note that removing the explicit forkOn and running with -N4 results in somewhat worse performance: real 2m6.548s user 2m13.470s sys 2m3.043s So in that sense the forkOn itself is not at fault. What I have found is that -qm is here a life saver: $ time ./orig 6000000 +RTS -N4 -qm real 0m2.773s user 0m5.610s sys 0m0.123s Adding -qa doesn't make a big difference. To summarise more runs (in terms of cpu used, user+sys): with forkOn: - -N4: 228s - -N4 -qa: 110s - -N4 -qm: 6s - -N4 -qm -qa: 6s without forkOn: - -N4: 253s - -N4 -qa: 252s - -N4 -qm: 5s - -N4 -qm -qa: 5s (Note that "without forkOn" is a bit slower in term of wall-clock, as the "with forkOn" version distributes the work a bit better, even if it uses overall a tiny bit more CPU.) So the question is, what does -qm actually do that it affects this benchmark so much (~50x)? (The docs are not very clear on it) And furthermore, could there be an heuristic inside the runtime such that automatic thread migration is suspended if threads are "over-migrated" (which is what I suppose happens here)? thanks for any explanations, iustin _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (1)
-
Simon Peyton-Jones