
Hello! I tried to implement the parallel Monte-Carlo method of computing Pi number, using two cores: --PROGRAM module Main where import Random import Data.Ratio import Data.List import System.IO import GHC.Conc main = do putStrLn "pi 1" putStr "n: " hFlush stdout t <- getLine piMonte (read t) >>= (putStrLn . show) piMonte n = do (g1, g2) <- split `fmap` getStdGen let r1 = r (n `div` 2) g1 r2 = r (n `div` 2 + n `mod` 2) g2 in return (ratio (r1 `par` (r2 `pseq` (merge r1 r2)))) where r n g = (length (filter id lAll), n) where l = take n . randomRs (0, 1) inCircle :: Double -> Double -> Bool inCircle a b = a*a + b*b <= 0.25 lAll = zipWith inCircle (l g1) (l g2) (g1, g2) = split g ratio :: (Int, Int) -> Double ratio (a, b) = fromRational (toInteger a % toInteger b * 16) merge (a, b) (c, d) = (a + c, b + d) --END But it uses only on core: C:\>ghc --make -threaded Monte.hs -fforce-recomp [1 of 1] Compiling Main ( Monte.hs, Monte.o ) Linking Monte.exe ... C:\>monte +RTS -N2 -s monte +RTS -N2 -s pi 1 n: 1000000 3.143616 2,766,670,536 bytes allocated in the heap 1,841,300 bytes copied during GC 5,872 bytes maximum residency (1 sample(s)) 23,548 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 5285 collections, 5284 parallel, 0.64s, 0.31s elapsed Generation 1: 1 collections, 1 parallel, 0.00s, 0.00s elapsed Parallel GC work balance: 1.00 (454838 / 454676, ideal 2) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 9.33s) 0.00s ( 0.00s) Task 1 (worker) : 0.63s ( 9.33s) 0.00s ( 0.00s) Task 2 (worker) : 6.00s ( 9.34s) 0.64s ( 0.31s) Task 3 (worker) : 0.00s ( 9.34s) 0.00s ( 0.00s) SPARKS: 1 (0 converted, 1 pruned) INIT time 0.02s ( 0.00s elapsed) MUT time 6.63s ( 9.34s elapsed) GC time 0.64s ( 0.31s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.28s ( 9.66s elapsed) %GC time 8.8% (3.2% elapsed) Alloc rate 416,628,033 bytes per MUT second Productivity 91.0% of total user, 68.6% of total elapsed We see that our one spark is pruned. Why? And another question. I compiled it also with -O: C:\>ghc --make -threaded Monte.hs -O -fforce-recomp [1 of 1] Compiling Main ( Monte.hs, Monte.o ) Linking Monte.exe ... C:\>monte +RTS -N2 -s monte +RTS -N2 -s pi 1 n: 1000000 3.148096 2,642,947,868 bytes allocated in the heap 1,801,952 bytes copied during GC 5,864 bytes maximum residency (1 sample(s)) 18,876 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 5077 collections, 5076 parallel, 0.08s, 0.05s elapsed Generation 1: 1 collections, 1 parallel, 0.00s, 0.00s elapsed Parallel GC work balance: 1.00 (445245 / 444651, ideal 2) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 3.94s ( 14.02s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 14.02s) 0.00s ( 0.00s) Task 2 (worker) : 5.61s ( 14.03s) 0.08s ( 0.05s) Task 3 (worker) : 0.00s ( 14.05s) 0.00s ( 0.00s) SPARKS: 1 (0 converted, 0 pruned) INIT time 0.02s ( 0.02s elapsed) MUT time 9.55s ( 14.03s elapsed) GC time 0.08s ( 0.05s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 9.64s ( 14.09s elapsed) %GC time 0.8% (0.3% elapsed) Alloc rate 276,386,705 bytes per MUT second Productivity 99.0% of total user, 67.7% of total elapsed We see, that with -O, 2 worker threads were doing some job, but overall performance is not better.
From one spark, zero - converted, zero - pruned. Is it a bug?

Am Mittwoch 17 März 2010 19:49:57 schrieb Artyom Kazak:
Hello! I tried to implement the parallel Monte-Carlo method of computing Pi number, using two cores: <move>
But it uses only on core:
<snip>
We see that our one spark is pruned. Why?
--PROGRAM module Main where
import Random import Data.Ratio import Data.List import System.IO import GHC.Conc
main = do putStrLn "pi 1" putStr "n: " hFlush stdout t <- getLine piMonte (read t) >>= (putStrLn . show)
piMonte n = do (g1, g2) <- split `fmap` getStdGen let r1 = r (n `div` 2) g1 r2 = r (n `div` 2 + n `mod` 2) g2 in return (ratio (r1 `par` (r2 `pseq` (merge r1 r2)))) where r n g = (length (filter id lAll), n)
Well, the problem is that your tasks don't do any real work - yet. piMonte returns a thunk pretty immediately, that thunk is then evaluated by show, long after your chance for parallelism is gone. You must force the work to be done _in_ r1 and r2, then you get parallelism: Generation 0: 2627 collections, 2626 parallel, 0.14s, 0.12s elapsed Generation 1: 1 collections, 1 parallel, 0.00s, 0.00s elapsed Parallel GC work balance: 1.79 (429262 / 240225, ideal 2) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 8.22s) 0.00s ( 0.00s) Task 1 (worker) : 8.16s ( 8.22s) 0.01s ( 0.01s) Task 2 (worker) : 8.00s ( 8.22s) 0.13s ( 0.11s) Task 3 (worker) : 0.00s ( 8.22s) 0.00s ( 0.00s) SPARKS: 1 (1 converted, 0 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 16.14s ( 8.22s elapsed) GC time 0.14s ( 0.12s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 16.29s ( 8.34s elapsed) %GC time 0.9% (1.4% elapsed) Alloc rate 163,684,377 bytes per MUT second Productivity 99.1% of total user, 193.5% of total elapsed But alas, it is slower than the single-threaded calculation :( INIT time 0.00s ( 0.00s elapsed) MUT time 7.08s ( 7.10s elapsed) GC time 0.08s ( 0.08s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.15s ( 7.18s elapsed) thunk----------^^^^^^^^^^^^^^^^^^^^^^^ That thunk doesn't take much work to produce, only to evaluate, so you must force the evaluation within r, e.g. via r n g = ln `pseq` (ln,n) where ln = length (filter id lAll) ... unfortunately, that doesn't give a speed-up, I don't know why.
where l = take n . randomRs (0, 1) inCircle :: Double -> Double -> Bool inCircle a b = a*a + b*b <= 0.25 lAll = zipWith inCircle (l g1) (l g2) (g1, g2) = split g ratio :: (Int, Int) -> Double ratio (a, b) = fromRational (toInteger a % toInteger b * 16) merge (a, b) (c, d) = (a + c, b + d) --END

On 17/03/10 21:30, Daniel Fischer wrote:
Am Mittwoch 17 März 2010 19:49:57 schrieb Artyom Kazak:
Hello! I tried to implement the parallel Monte-Carlo method of computing Pi number, using two cores: <move>
But it uses only on core:
<snip>
We see that our one spark is pruned. Why?
Well, the problem is that your tasks don't do any real work - yet. piMonte returns a thunk pretty immediately, that thunk is then evaluated by show, long after your chance for parallelism is gone. You must force the work to be done _in_ r1 and r2, then you get parallelism:
Generation 0: 2627 collections, 2626 parallel, 0.14s, 0.12s elapsed Generation 1: 1 collections, 1 parallel, 0.00s, 0.00s elapsed
Parallel GC work balance: 1.79 (429262 / 240225, ideal 2)
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 8.22s) 0.00s ( 0.00s) Task 1 (worker) : 8.16s ( 8.22s) 0.01s ( 0.01s) Task 2 (worker) : 8.00s ( 8.22s) 0.13s ( 0.11s) Task 3 (worker) : 0.00s ( 8.22s) 0.00s ( 0.00s)
SPARKS: 1 (1 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 16.14s ( 8.22s elapsed) GC time 0.14s ( 0.12s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 16.29s ( 8.34s elapsed)
%GC time 0.9% (1.4% elapsed)
Alloc rate 163,684,377 bytes per MUT second
Productivity 99.1% of total user, 193.5% of total elapsed
But alas, it is slower than the single-threaded calculation :(
INIT time 0.00s ( 0.00s elapsed) MUT time 7.08s ( 7.10s elapsed) GC time 0.08s ( 0.08s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.15s ( 7.18s elapsed)
It works for me (GHC 6.12.1): SPARKS: 1 (1 converted, 0 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 9.05s ( 4.54s elapsed) GC time 0.12s ( 0.09s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 9.12s ( 4.63s elapsed) wall-clock speedup of 1.93 on 2 cores. What hardware are you using there? Have you tried changing any GC settings? Cheers, Simon

Am Donnerstag 18 März 2010 22:44:55 schrieb Simon Marlow:
On 17/03/10 21:30, Daniel Fischer wrote:
Am Mittwoch 17 März 2010 19:49:57 schrieb Artyom Kazak:
Hello! I tried to implement the parallel Monte-Carlo method of computing Pi number, using two cores:
<move>
But it uses only on core:
<snip>
We see that our one spark is pruned. Why?
Well, the problem is that your tasks don't do any real work - yet. piMonte returns a thunk pretty immediately, that thunk is then evaluated by show, long after your chance for parallelism is gone. You must force the work to be done _in_ r1 and r2, then you get parallelism:
Generation 0: 2627 collections, 2626 parallel, 0.14s, 0.12s elapsed Generation 1: 1 collections, 1 parallel, 0.00s, 0.00s elapsed
Parallel GC work balance: 1.79 (429262 / 240225, ideal 2)
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 8.22s) 0.00s ( 0.00s) Task 1 (worker) : 8.16s ( 8.22s) 0.01s ( 0.01s) Task 2 (worker) : 8.00s ( 8.22s) 0.13s ( 0.11s) Task 3 (worker) : 0.00s ( 8.22s) 0.00s ( 0.00s)
SPARKS: 1 (1 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 16.14s ( 8.22s elapsed) GC time 0.14s ( 0.12s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 16.29s ( 8.34s elapsed)
%GC time 0.9% (1.4% elapsed)
Alloc rate 163,684,377 bytes per MUT second
Productivity 99.1% of total user, 193.5% of total elapsed
But alas, it is slower than the single-threaded calculation :(
INIT time 0.00s ( 0.00s elapsed) MUT time 7.08s ( 7.10s elapsed) GC time 0.08s ( 0.08s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.15s ( 7.18s elapsed)
It works for me (GHC 6.12.1):
SPARKS: 1 (1 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 9.05s ( 4.54s elapsed) GC time 0.12s ( 0.09s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 9.12s ( 4.63s elapsed)
wall-clock speedup of 1.93 on 2 cores.
Is that Artyom's original code or with the pseq'ed length? The original didn't convert any sparks for me (~103% cpu, because of parallel GC, but the calculation always used just one thread). I'm also using 6.12.1. And, with -N2, I also have a productivity of 193.5%, but the elapsed time is larger than the elapsed time for -N1. How long does it take with -N1 for you? It's the same with 6.10.3, no converted sparks for the original code, parallelism with the pseq'ed length, but it takes longer than with -N1.
What hardware are you using there?
3.06GHz Pentium 4, 2 cores. I have mixed results with parallelism, some programmes get a speed-up of nearly a factor 2 (wall-clock time), others 1.4, 1.5 or so, yet others take about the same wall-clock time as the single threaded programme, some - like this - take longer despite using both cores intensively.
Have you tried changing any GC settings?
I've played around a little with -qg and -qb and -C, but that showed little influence. Any tips what else might be worth a try?
Cheers, Simon

Daniel Fischer wrote:
3.06GHz Pentium 4, 2 cores.
Do you have more info on that? Try: grep 'model name' /proc/cpuinfo The original Pentium 4 (eg "Intel(R) Pentium(R) 4 CPU 3.00GHz") had hyperthreading which was actually pretty pathetic for parallelism. The Core 2 Duos (eg "Intel(R) Core(TM)2 Duo CPU T9600 @ 2.80GHz") are far superior. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Am Freitag 19 März 2010 00:56:15 schrieb Erik de Castro Lopo:
Daniel Fischer wrote:
3.06GHz Pentium 4, 2 cores.
Do you have more info on that? Try:
grep 'model name' /proc/cpuinfo
Well, $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 3.06GHz stepping : 9 cpu MHz : 3058.795 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pebs bts pni monitor ds_cpl tm2 cid cx16 xtpr lahf_lm bogomips : 6117.59 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 3.06GHz stepping : 9 cpu MHz : 3058.795 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pebs bts pni monitor ds_cpl tm2 cid cx16 xtpr lahf_lm bogomips : 6118.20 clflush size : 64 power management: Does that mean two CPUs, each with two siblings, or what is the correct interpretation?
The original Pentium 4 (eg "Intel(R) Pentium(R) 4 CPU 3.00GHz") had hyperthreading which was actually pretty pathetic for parallelism.
The Core 2 Duos (eg "Intel(R) Core(TM)2 Duo CPU T9600 @ 2.80GHz") are far superior.
But probably also far more expensive :) I bought something cheap and was actually surprised when I discovered that it seemed to have two Cores/CPUs.
Erik

On Fri, 19 Mar 2010 01:22:58 +0100, Daniel Fischer wrote:
core id : 0 cpu cores : 1
It is one of those pathetic single core pentium4 with so called hyper-threading enabled. You should have checked the intel product spreadsheet before investing such an old cpu. -- J c/* __o/* X <\ * (__ Y */\ <

Am Freitag 19 März 2010 02:25:47 schrieb Xiao-Yong Jin:
On Fri, 19 Mar 2010 01:22:58 +0100, Daniel Fischer wrote:
core id : 0 cpu cores : 1
It is one of those pathetic single core pentium4 with so called hyper-threading enabled.
'kay, but why does it say processor : 0 ... processor : 1 ?
You should have checked the intel product spreadsheet before investing such an old cpu.
It was the cheapest box in town :) And it was less old when I bought it.

Daniel Fischer wrote:
Am Freitag 19 März 2010 02:25:47 schrieb Xiao-Yong Jin:
It is one of those pathetic single core pentium4 with so called hyper-threading enabled.
'kay, but why does it say
processor : 0 ... processor : 1
Hyperthreading is explained here: http://www.pcstats.com/articleview.cfm?articleID=1302 As explained, two hyperthreads is not euqivalent to to CPU cores because the two hyperthreads share resources while 2 discrete cores do not. As I remember it, the performance of the Pentium 4s with HT never met up to the promise and that line was swiftly replaced by the Core 2 Duo range of CPUs which we actually quite good. As a rough and ready test, I compiled Ben Lippmeier's DDC compiler on the following CPUS: a) Intel(R) Pentium(R) 4 CPU 3.00GHz (2Meg cache) b) Intel(R) Core(TM)2 Duo CPU T9600 @ 2.80GHz (6Meg cache) Using the ghc-6.12.1 on both (32bit Ubuntu 10.04 chroot for the P4 and a 32bit Debian unstable chroot for the Core2Duo), compiling DDC took (using 'make clean ; time make'): a) 2m54.301s on the P4 HT b) 0m59.277s on the Core2Duo If nothing else, it shows that two CPUs with similar clock speeds and the same number of processors listed in /proc/cpuinfo can have vastly different performance characteristics. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Am Freitag 19 März 2010 04:24:21 schrieb Erik de Castro Lopo:
Daniel Fischer wrote:
Am Freitag 19 März 2010 02:25:47 schrieb Xiao-Yong Jin:
It is one of those pathetic single core pentium4 with so called hyper-threading enabled.
'kay, but why does it say
processor : 0 ... processor : 1
Hyperthreading is explained here:
Thanks. That clears things up a little.
Erik

On Mar 18, 2010, at 21:58 , Daniel Fischer wrote:
Am Freitag 19 März 2010 02:25:47 schrieb Xiao-Yong Jin:
On Fri, 19 Mar 2010 01:22:58 +0100, Daniel Fischer wrote:
core id : 0 cpu cores : 1
It is one of those pathetic single core pentium4 with so called hyper-threading enabled.
'kay, but why does it say
processor : 0 ... processor : 1 ?
Because that's how Linux presents what amounts to "CPU resources", whether real (multiple cores) or virtual (HTT). You need to scan down to the core information to see if they're real or not. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Mar 18, 2010, at 21:25 , Xiao-Yong Jin wrote:
On Fri, 19 Mar 2010 01:22:58 +0100, Daniel Fischer wrote:
core id : 0 cpu cores : 1
It is one of those pathetic single core pentium4 with so called hyper-threading enabled. You should have checked the intel product spreadsheet before investing such an old cpu.
I'm a little surprised it's using both; I thought Linux (and other OSes) had disabled HTT by default because of the cache sniffing attacks. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On 18/03/10 22:52, Daniel Fischer wrote:
Am Donnerstag 18 März 2010 22:44:55 schrieb Simon Marlow:
On 17/03/10 21:30, Daniel Fischer wrote:
Am Mittwoch 17 März 2010 19:49:57 schrieb Artyom Kazak:
Hello! I tried to implement the parallel Monte-Carlo method of computing Pi number, using two cores:
<move>
But it uses only on core:
<snip>
We see that our one spark is pruned. Why?
Well, the problem is that your tasks don't do any real work - yet. piMonte returns a thunk pretty immediately, that thunk is then evaluated by show, long after your chance for parallelism is gone. You must force the work to be done _in_ r1 and r2, then you get parallelism:
Generation 0: 2627 collections, 2626 parallel, 0.14s, 0.12s elapsed Generation 1: 1 collections, 1 parallel, 0.00s, 0.00s elapsed
Parallel GC work balance: 1.79 (429262 / 240225, ideal 2)
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 8.22s) 0.00s ( 0.00s) Task 1 (worker) : 8.16s ( 8.22s) 0.01s ( 0.01s) Task 2 (worker) : 8.00s ( 8.22s) 0.13s ( 0.11s) Task 3 (worker) : 0.00s ( 8.22s) 0.00s ( 0.00s)
SPARKS: 1 (1 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 16.14s ( 8.22s elapsed) GC time 0.14s ( 0.12s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 16.29s ( 8.34s elapsed)
%GC time 0.9% (1.4% elapsed)
Alloc rate 163,684,377 bytes per MUT second
Productivity 99.1% of total user, 193.5% of total elapsed
But alas, it is slower than the single-threaded calculation :(
INIT time 0.00s ( 0.00s elapsed) MUT time 7.08s ( 7.10s elapsed) GC time 0.08s ( 0.08s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.15s ( 7.18s elapsed)
It works for me (GHC 6.12.1):
SPARKS: 1 (1 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 9.05s ( 4.54s elapsed) GC time 0.12s ( 0.09s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 9.12s ( 4.63s elapsed)
wall-clock speedup of 1.93 on 2 cores.
Is that Artyom's original code or with the pseq'ed length?
Your fixed version.
And, with -N2, I also have a productivity of 193.5%, but the elapsed time is larger than the elapsed time for -N1. How long does it take with -N1 for you?
The 1.93 speedup was compared to the time for -N1 (8.98s in my case).
What hardware are you using there?
3.06GHz Pentium 4, 2 cores. I have mixed results with parallelism, some programmes get a speed-up of nearly a factor 2 (wall-clock time), others 1.4, 1.5 or so, yet others take about the same wall-clock time as the single threaded programme, some - like this - take longer despite using both cores intensively.
I suspect it's something specific to that processor, probably cache-related. Perhaps we've managed to put some data frequently accessed by both CPUs on the same cache line. I'd have to do some detailed profiling on that processor to find out though. If you're have the time and inclination, install oprofile and look for things like "memory ordering stalls".
Have you tried changing any GC settings?
I've played around a little with -qg and -qb and -C, but that showed little influence. Any tips what else might be worth a try?
-A would be the other thing to try. Cheers, Simon
Cheers, Simon

-----Ursprüngliche Nachricht-----
Von: Simon Marlow
On 18/03/10 22:52, Daniel Fischer wrote:
Am Donnerstag 18 März 2010 22:44:55 schrieb Simon Marlow:
On 17/03/10 21:30, Daniel Fischer wrote: It works for me (GHC 6.12.1):
SPARKS: 1 (1 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 9.05s ( 4.54s elapsed) GC time 0.12s ( 0.09s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 9.12s ( 4.63s elapsed)
wall-clock speedup of 1.93 on 2 cores.
Is that Artyom's original code or with the pseq'ed length?
Your fixed version.
Good. So I can at least continue to believe I have a rough idea of how GHC behaves.
And, with -N2, I also have a productivity of 193.5%, but the elapsed time is larger than the elapsed time for -N1. How long does it take with -N1 for you?
The 1.93 speedup was compared to the time for -N1 (8.98s in my case).
What hardware are you using there?
3.06GHz Pentium 4, 2 cores. I have mixed results with parallelism, some programmes get a speed-up of nearly a factor 2 (wall-clock time), others 1.4, 1.5 or so, yet others take about the same wall-clock time as the single threaded programme, some - like this - take longer despite using both cores intensively.
I suspect it's something specific to that processor, probably cache-related. Perhaps we've managed to put some data frequently accessed by both CPUs on the same cache line. I'd have to do some detailed profiling on that processor to find out though. If you're have the time and inclination, install oprofile and look for things like "memory ordering stalls".
It seems that I've just been fooled by /proc/cpuinfo listing it as two and having something like 190% cpu usage in top/time. Being oblivious of almost everything hardware-related, I naively took it at face value. In fact it's probably just one hyperthreaded CPU, so since the two threads here do exactly the same type of work, it's natural then that it doesn't give a speed-up.
Have you tried changing any GC settings?
I've played around a little with -qg and -qb and -C, but that showed little influence. Any tips what else might be worth a try?
-A would be the other thing to try.
Cheers, Simon
Cheers, Simon

Daniel Fischer
3.06GHz Pentium 4, 2 cores.
[I.e. a single-core hyperthreaded CPU]
I have mixed results with parallelism, some programmes get a speed-up of nearly a factor 2 (wall-clock time), others 1.4, 1.5 or so, yet others take about the same wall-clock time as the single threaded programme, some - like this - take longer despite using both cores intensively.
Given the negative press around HT, I'm surprised you see this good results on many programs. I thought the main benefit from Intel's HT was to reduce the impact of memory latency, that is, when one thread was blocking on memory, it could switch immediately to anther, ready-to-run, thread. (I may be misunderstanding this, though). I think the general consensus was a 10-15% speedup from HT. Anyway, the thing to get these days is of course Nehalem, A.K.A. Core i{3,5,7}, which seems to give a nice speedup over Core 2. Among other things, it dynamically overclocks the busy cores (using the more market-friendly term "turbo mode"), making it even harder to compare performance reliably. Interesting times. -k -- If I haven't seen further, it is by standing in the footprints of giants

On 19/03/10 09:00, Ketil Malde wrote:
Daniel Fischer
writes: 3.06GHz Pentium 4, 2 cores.
[I.e. a single-core hyperthreaded CPU]
Ah, that would definitely explain a lack of parallelism. I'm just grateful we don't have another one of those multicore cache-line performance bugs, becuase they're a nightmare to track down. Cheers, Simon
participants (8)
-
Artyom Kazak
-
Brandon S. Allbery KF8NH
-
Colin Paul Adams
-
Daniel Fischer
-
Erik de Castro Lopo
-
Ketil Malde
-
Simon Marlow
-
Xiao-Yong Jin