how to read CPU time vs wall time report from GHC?

Hi guys, I'm new in parallel programming with Haskell. I made a simple test program using that par combinator etc, and was a bit unhappy that it turns out to be slower than its sequential version. But firstly, I dont fully understand how to read the runtime report produced by GHC with -s option: SPARKS: 5 (5 converted, 0 pruned) INIT time 0.02s ( 0.01s elapsed) MUT time 3.46s ( 0.89s elapsed) GC time 5.49s ( 1.46s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.97s ( 2.36s elapsed) As I understand it from the documentation, the left time-column is the CPU time, whereas the right one is elapses wall time. But how come that the wall time is less than the CPU time? Isn't wall time = user's perspective of time; so that is CPU time + IO + etc? Any help? --Wish.

Hello Wishnu, That is slightly odd. What CPU and operating system are you running on? Include Kernel versions if Linux. Cheers, Edward Excerpts from Wishnu Prasetya's message of Sun Aug 14 14:11:36 -0400 2011:
Hi guys,
I'm new in parallel programming with Haskell. I made a simple test program using that par combinator etc, and was a bit unhappy that it turns out to be slower than its sequential version. But firstly, I dont fully understand how to read the runtime report produced by GHC with -s option:
SPARKS: 5 (5 converted, 0 pruned)
INIT time 0.02s ( 0.01s elapsed) MUT time 3.46s ( 0.89s elapsed) GC time 5.49s ( 1.46s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.97s ( 2.36s elapsed)
As I understand it from the documentation, the left time-column is the CPU time, whereas the right one is elapses wall time. But how come that the wall time is less than the CPU time? Isn't wall time = user's perspective of time; so that is CPU time + IO + etc?
Any help?
--Wish.

Hello Edward, I'm using Windows 7 on Intel i7 (4 cores with hyperthreading)... --Wish.
Hello Wishnu,
That is slightly odd. What CPU and operating system are you running on? Include Kernel versions if Linux.
Cheers, Edward
Excerpts from Wishnu Prasetya's message of Sun Aug 14 14:11:36 -0400 2011:
Hi guys,
I'm new in parallel programming with Haskell. I made a simple test program using that par combinator etc, and was a bit unhappy that it turns out to be slower than its sequential version. But firstly, I dont fully understand how to read the runtime report produced by GHC with -s option:
SPARKS: 5 (5 converted, 0 pruned)
INIT time 0.02s ( 0.01s elapsed) MUT time 3.46s ( 0.89s elapsed) GC time 5.49s ( 1.46s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.97s ( 2.36s elapsed)
As I understand it from the documentation, the left time-column is the CPU time, whereas the right one is elapses wall time. But how come that the wall time is less than the CPU time? Isn't wall time = user's perspective of time; so that is CPU time + IO + etc?
Any help?
--Wish.

On Sun, Aug 14, 2011 at 08:11:36PM +0200, Wishnu Prasetya wrote:
Hi guys,
I'm new in parallel programming with Haskell. I made a simple test program using that par combinator etc, and was a bit unhappy that it turns out to be slower than its sequential version. But firstly, I dont fully understand how to read the runtime report produced by GHC with -s option:
SPARKS: 5 (5 converted, 0 pruned)
INIT time 0.02s ( 0.01s elapsed) MUT time 3.46s ( 0.89s elapsed) GC time 5.49s ( 1.46s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.97s ( 2.36s elapsed)
As I understand it from the documentation, the left time-column is the CPU time, whereas the right one is elapses wall time. But how come that the wall time is less than the CPU time? Isn't wall time = user's perspective of time; so that is CPU time + IO + etc?
Yes, but if you have multiple CPUs, then CPU time "accumulates" faster than wall-clock time. Based on the above example, I guess you have or you run the program on 4 cores (2.36 * 4 = 9.44, which means you got a very nice ~95% efficiency). regards, iustin

Ah, good catch. :-) Edward Excerpts from Iustin Pop's message of Sun Aug 14 14:25:02 -0400 2011:
On Sun, Aug 14, 2011 at 08:11:36PM +0200, Wishnu Prasetya wrote:
Hi guys,
I'm new in parallel programming with Haskell. I made a simple test program using that par combinator etc, and was a bit unhappy that it turns out to be slower than its sequential version. But firstly, I dont fully understand how to read the runtime report produced by GHC with -s option:
SPARKS: 5 (5 converted, 0 pruned)
INIT time 0.02s ( 0.01s elapsed) MUT time 3.46s ( 0.89s elapsed) GC time 5.49s ( 1.46s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.97s ( 2.36s elapsed)
As I understand it from the documentation, the left time-column is the CPU time, whereas the right one is elapses wall time. But how come that the wall time is less than the CPU time? Isn't wall time = user's perspective of time; so that is CPU time + IO + etc?
Yes, but if you have multiple CPUs, then CPU time "accumulates" faster than wall-clock time.
Based on the above example, I guess you have or you run the program on 4 cores (2.36 * 4 = 9.44, which means you got a very nice ~95% efficiency).
regards, iustin

On 14-8-2011 20:25, Iustin Pop wrote:
Hi guys,
I'm new in parallel programming with Haskell. I made a simple test program using that par combinator etc, and was a bit unhappy that it turns out to be slower than its sequential version. But firstly, I dont fully understand how to read the runtime report produced by GHC with -s option:
SPARKS: 5 (5 converted, 0 pruned)
INIT time 0.02s ( 0.01s elapsed) MUT time 3.46s ( 0.89s elapsed) GC time 5.49s ( 1.46s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.97s ( 2.36s elapsed)
As I understand it from the documentation, the left time-column is the CPU time, whereas the right one is elapses wall time. But how come that the wall time is less than the CPU time? Isn't wall time = user's perspective of time; so that is CPU time + IO + etc? Yes, but if you have multiple CPUs, then CPU time "accumulates" faster
On Sun, Aug 14, 2011 at 08:11:36PM +0200, Wishnu Prasetya wrote: than wall-clock time.
Based on the above example, I guess you have or you run the program on 4 cores (2.36 * 4 = 9.44, which means you got a very nice ~95% efficiency).
regards, iustin That makes sense... But are you sure thats how i should read this? I dont want to jump happy too early.
--Wish.

On Sun, Aug 14, 2011 at 08:32:36PM +0200, Wishnu Prasetya wrote:
On 14-8-2011 20:25, Iustin Pop wrote:
Hi guys,
I'm new in parallel programming with Haskell. I made a simple test program using that par combinator etc, and was a bit unhappy that it turns out to be slower than its sequential version. But firstly, I dont fully understand how to read the runtime report produced by GHC with -s option:
SPARKS: 5 (5 converted, 0 pruned)
INIT time 0.02s ( 0.01s elapsed) MUT time 3.46s ( 0.89s elapsed) GC time 5.49s ( 1.46s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.97s ( 2.36s elapsed)
As I understand it from the documentation, the left time-column is the CPU time, whereas the right one is elapses wall time. But how come that the wall time is less than the CPU time? Isn't wall time = user's perspective of time; so that is CPU time + IO + etc? Yes, but if you have multiple CPUs, then CPU time "accumulates" faster
On Sun, Aug 14, 2011 at 08:11:36PM +0200, Wishnu Prasetya wrote: than wall-clock time.
Based on the above example, I guess you have or you run the program on 4 cores (2.36 * 4 = 9.44, which means you got a very nice ~95% efficiency).
regards, iustin
That makes sense... But are you sure thats how i should read this?
As far as I know, this is correct.
I dont want to jump happy too early.
Well, you algorithm does work in parallel, but if you look at the GC/MUT time, ~60% of the total runtime is spent in GC, so you have a space leak or an otherwise inefficient algorithm. The final speedup is just 3.46s/2.36s, i.e. 1.46x instead of ~4x, so you still have some work to do to make this better. At least, this is how I read those numbers. regards, iustin

On Sunday 14 August 2011, 21:53:21, Iustin Pop wrote:
On Sun, Aug 14, 2011 at 08:32:36PM +0200, Wishnu Prasetya wrote:
On 14-8-2011 20:25, Iustin Pop wrote:
On Sun, Aug 14, 2011 at 08:11:36PM +0200, Wishnu Prasetya wrote:
Hi guys,
I'm new in parallel programming with Haskell. I made a simple test program using that par combinator etc, and was a bit unhappy that it turns out to be slower than its sequential version. But firstly, I dont fully understand how to read the runtime report produced by GHC
with -s option: SPARKS: 5 (5 converted, 0 pruned)
INIT time 0.02s ( 0.01s elapsed) MUT time 3.46s ( 0.89s elapsed) GC time 5.49s ( 1.46s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.97s ( 2.36s elapsed)
As I understand it from the documentation, the left time-column is the CPU time, whereas the right one is elapses wall time. But how come that the wall time is less than the CPU time? Isn't wall time = user's perspective of time; so that is CPU time + IO + etc?
Yes, but if you have multiple CPUs, then CPU time "accumulates" faster than wall-clock time.
Based on the above example, I guess you have or you run the program on 4 cores (2.36 * 4 = 9.44, which means you got a very nice ~95% efficiency).
regards, iustin
That makes sense... But are you sure thats how i should read this?
As far as I know, this is correct.
It is indeed. CPU time is the sum of CPU time for all threads, which is typically larger than elapsed time when several threads run in parallel.
I dont want to jump happy too early.
Well, you algorithm does work in parallel, but if you look at the GC/MUT time, ~60% of the total runtime is spent in GC, so you have a space leak or an otherwise inefficient algorithm.
Not enough data to make more than guesses concerning the cause, but 60% GC definitely indicates a problem with the algorithm (resp. its implementation),
The final speedup is just 3.46s/2.36s, i.e. 1.46x instead of ~4x, so you still have some work to do to make this better.
We don't know the times for a non-threaded run (or an -N1 run), so it could be anything from a slowdown to a > 4× speedup (but it's likely to be a speedup by a factor < 4×).
At least, this is how I read those numbers.
regards, iustin

On 14-8-2011 22:17, Daniel Fischer wrote:
On Sunday 14 August 2011, 21:53:21, Iustin Pop wrote:
On Sun, Aug 14, 2011 at 08:32:36PM +0200, Wishnu Prasetya wrote:
On 14-8-2011 20:25, Iustin Pop wrote:
On Sun, Aug 14, 2011 at 08:11:36PM +0200, Wishnu Prasetya wrote:
Hi guys,
I'm new in parallel programming with Haskell. I made a simple test program using that par combinator etc, and was a bit unhappy that it turns out to be slower than its sequential version. But firstly, I dont fully understand how to read the runtime report produced by GHC
with -s option: SPARKS: 5 (5 converted, 0 pruned)
INIT time 0.02s ( 0.01s elapsed) MUT time 3.46s ( 0.89s elapsed) GC time 5.49s ( 1.46s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.97s ( 2.36s elapsed)
As I understand it from the documentation, the left time-column is the CPU time, whereas the right one is elapses wall time. But how come that the wall time is less than the CPU time? Isn't wall time = user's perspective of time; so that is CPU time + IO + etc? Yes, but if you have multiple CPUs, then CPU time "accumulates" faster than wall-clock time.
Based on the above example, I guess you have or you run the program on 4 cores (2.36 * 4 = 9.44, which means you got a very nice ~95% efficiency).
regards, iustin That makes sense... But are you sure thats how i should read this? As far as I know, this is correct. It is indeed. CPU time is the sum of CPU time for all threads, which is typically larger than elapsed time when several threads run in parallel.
I dont want to jump happy too early. Well, you algorithm does work in parallel, but if you look at the GC/MUT time, ~60% of the total runtime is spent in GC, so you have a space leak or an otherwise inefficient algorithm. Not enough data to make more than guesses concerning the cause, but 60% GC definitely indicates a problem with the algorithm (resp. its implementation),
The final speedup is just 3.46s/2.36s, i.e. 1.46x instead of ~4x, so you still have some work to do to make this better. We don't know the times for a non-threaded run (or an -N1 run), so it could be anything from a slowdown to a> 4× speedup (but it's likely to be a speedup by a factor< 4×).
Well, the -N1 is below. The sequential version of the program has almost the same profile: SPARKS: 5 (1 converted, 4 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 2.78s ( 2.99s elapsed) GC time 4.35s ( 4.15s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.13s ( 7.14s elapsed) Am I correct then to say that the speed up with respect to sequential is equal to: tot-elapse-time-N1 / tot-elapse-N4 ? So I have 7.14 / 2.36 = 3.0 speed up, and not 1.46 as Iustin said? I'll probably have to do something with that GC :) --Wish.

On Sunday 14 August 2011, 22:42:13, Wishnu Prasetya wrote:
On 14-8-2011 22:17, Daniel Fischer wrote:
We don't know the times for a non-threaded run (or an -N1 run), so it could be anything from a slowdown to a> 4× speedup (but it's likely to be a speedup by a factor< 4×).
Well, the -N1 is below. The sequential version of the program has almost the same profile:
SPARKS: 5 (1 converted, 4 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 2.78s ( 2.99s elapsed) GC time 4.35s ( 4.15s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.13s ( 7.14s elapsed)
Am I correct then to say that the speed up with respect to sequential is equal to: tot-elapse-time-N1 / tot-elapse-N4 ? So I have 7.14 / 2.36 = 3.0 speed up, and not 1.46 as Iustin said?
Yes (with respect to wall-clock time of course). That's not too bad, though it should be possible to increase that factor.
I'll probably have to do something with that GC :)
But that should be the first target, there's probably some low-hanging fruit there. Maybe increasing the size of the allocation area (+RTS -Ax) or the heap (+RTS -Hx) would already do some good. Also do heap profiling to find out what most likely takes so much GC time (before compiling for profiling, running with +RTS -hT could produce useful information).

On 14-8-2011 23:05, Daniel Fischer wrote:
On Sunday 14 August 2011, 22:42:13, Wishnu Prasetya wrote:
We don't know the times for a non-threaded run (or an -N1 run), so it could be anything from a slowdown to a> 4× speedup (but it's likely to be a speedup by a factor< 4×). Well, the -N1 is below. The sequential version of the program has almost
On 14-8-2011 22:17, Daniel Fischer wrote: the same profile:
SPARKS: 5 (1 converted, 4 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 2.78s ( 2.99s elapsed) GC time 4.35s ( 4.15s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.13s ( 7.14s elapsed)
Am I correct then to say that the speed up with respect to sequential is equal to: tot-elapse-time-N1 / tot-elapse-N4 ? So I have 7.14 / 2.36 = 3.0 speed up, and not 1.46 as Iustin said? Yes (with respect to wall-clock time of course). That's not too bad, though it should be possible to increase that factor.
I'll probably have to do something with that GC :) But that should be the first target, there's probably some low-hanging fruit there. Maybe increasing the size of the allocation area (+RTS -Ax) or the heap (+RTS -Hx) would already do some good. Also do heap profiling to find out what most likely takes so much GC time (before compiling for profiling, running with +RTS -hT could produce useful information). Ah ok. Thanks a lot for all the answers and tips.
--Wish.
participants (4)
-
Daniel Fischer
-
Edward Z. Yang
-
Iustin Pop
-
Wishnu Prasetya