
Hi, all! I have strange GHC behavior. Consider the code: import Control.Parallel main = print (o `par` (fromInteger e) / (fromInteger o)) where [e,o] = map sum $ map (`filter` numbers) [even, odd] numbers = [1..10000000] When it compiled without threaded it has 19068 ms to run, 396 Mb total memory in use and %GC time 88.2%, the same with -threaded and +RTS -N1, but with +RTS -N2 it takes only 3806 ms to run, 3 Mb total memory in use and %GC time 8.1%. Why it so? It's a bug or I missed something? I test it on dual-core Athlon X2 4200+ 2Gb running 64bit Gentoo system. gcc 4.2.2 and ghc 6.8.2. -- Ruslan

On 16 Feb 2008, at 3:06 PM, Ruslan Evdokimov wrote:
Hi, all!
I have strange GHC behavior. Consider the code:
import Control.Parallel
main = print (o `par` (fromInteger e) / (fromInteger o)) where [e,o] = map sum $ map (`filter` numbers) [even, odd] numbers = [1..10000000]
When it compiled without threaded it has 19068 ms to run, 396 Mb total memory in use and %GC time 88.2%, the same with - threaded and +RTS -N1, but with +RTS -N2 it takes only 3806 ms to run, 3 Mb total memory in use and %GC time 8.1%. Why it so? It's a bug or I missed something?
Wild guess? If you leave o as a thunk, to be evaluated once the program has e, then it has numbers, so you keep the entire 10-million entry list in memory. Evaluating e and o in parallel allows the system to start garbage collecting cons cells from numbers much earlier, which reduces residency (I'd've been unsuprised at more than two orders of magnitude). Managing the smaller heap (and especially not having to copy numbers on each GC) then makes the garbage collector go much faster, so you get a smaller run time.
I test it on dual-core Athlon X2 4200+ 2Gb running 64bit Gentoo system. gcc 4.2.2 and ghc 6.8.2.
jcc

2008/2/17, Jonathan Cast
Wild guess? If you leave o as a thunk, to be evaluated once the program has e, then it has numbers, so you keep the entire 10-million entry list in memory. Evaluating e and o in parallel allows the system to start garbage collecting cons cells from numbers much earlier, which reduces residency (I'd've been unsuprised at more than two orders of magnitude). Managing the smaller heap (and especially not having to copy numbers on each GC) then makes the garbage collector go much faster, so you get a smaller run time.
But I also tested it on P-IV 3.0 with HT and 1GB (single core) running Windows-XP (ghc 6.8.2), and it works fine (fast & low GC) in all three cases without significant difference. Sure it didn't runs faster with -N2 'cause it's not dual-core.

ruslan.evdokimov:
2008/2/17, Jonathan Cast
: Wild guess? If you leave o as a thunk, to be evaluated once the program has e, then it has numbers, so you keep the entire 10-million entry list in memory. Evaluating e and o in parallel allows the system to start garbage collecting cons cells from numbers much earlier, which reduces residency (I'd've been unsuprised at more than two orders of magnitude). Managing the smaller heap (and especially not having to copy numbers on each GC) then makes the garbage collector go much faster, so you get a smaller run time.
But I also tested it on P-IV 3.0 with HT and 1GB (single core) running Windows-XP (ghc 6.8.2), and it works fine (fast & low GC) in all three cases without significant difference. Sure it didn't runs faster with -N2 'cause it's not dual-core.
What flags did you compile the code with?

On Sun, Feb 17, 2008 at 03:07:15AM +0300, Ruslan Evdokimov wrote:
2008/2/17, Jonathan Cast
: Wild guess? If you leave o as a thunk, to be evaluated once the program has e, then it has numbers, so you keep the entire 10-million entry list in memory. Evaluating e and o in parallel allows the system to start garbage collecting cons cells from numbers much earlier, which reduces residency (I'd've been unsuprised at more than two orders of magnitude). Managing the smaller heap (and especially not having to copy numbers on each GC) then makes the garbage collector go much faster, so you get a smaller run time.
But I also tested it on P-IV 3.0 with HT and 1GB (single core) running Windows-XP (ghc 6.8.2), and it works fine (fast & low GC) in all three cases without significant difference. Sure it didn't runs faster with -N2 'cause it's not dual-core.
This makes perfect sense - -N2 tells GHC to use two threads, and if you run two threads on a single-processor system it's implemented by running the threads alternatingly (around 100/s for modern Linux, probably similar for other systems). Thus, the two evaluations never get more than a hundreth of a second out of step, and memory usage is still low. Stefan

2008/2/17, Stefan O'Rear
This makes perfect sense - -N2 tells GHC to use two threads, and if you run two threads on a single-processor system it's implemented by running the threads alternatingly (around 100/s for modern Linux, probably similar for other systems). Thus, the two evaluations never get more than a hundreth of a second out of step, and memory usage is still low.
Stefan
Test on windows XP AthlonX2 4200+ 2Gb: C:\imp>test 1 12328 ms C:\imp>test +RTS -N2 1 5234 ms C:\imp>test +RTS -N2 1 3515 ms 1st - 1 thread 2nd - 2 threads on single core (one core disabled through Task Manager) 3rd - 2 threads on different cores

On Sun, Feb 17, 2008 at 03:41:52AM +0300, Ruslan Evdokimov wrote:
2008/2/17, Stefan O'Rear
: This makes perfect sense - -N2 tells GHC to use two threads, and if you run two threads on a single-processor system it's implemented by running the threads alternatingly (around 100/s for modern Linux, probably similar for other systems). Thus, the two evaluations never get more than a hundreth of a second out of step, and memory usage is still low.
Stefan
Test on windows XP AthlonX2 4200+ 2Gb:
C:\imp>test 1 12328 ms
C:\imp>test +RTS -N2 1 5234 ms
C:\imp>test +RTS -N2 1 3515 ms
1st - 1 thread 2nd - 2 threads on single core (one core disabled through Task Manager) 3rd - 2 threads on different cores
As far as I can tell, that confirms my explanation. If you see it differently - say how. Stefan

2008/2/17, Stefan O'Rear
As far as I can tell, that confirms my explanation. If you see it differently - say how.
Stefan
Seems you're right, I changed it to: [e,o] = map sum $ [filter even numbers, (filter odd) $ reverse numbers] It prevents numbers from being collected and here is results:
test.exe 1 12812 ms
test.exe +RTS -N2 1 16671 ms
participants (4)
-
Don Stewart
-
Jonathan Cast
-
Ruslan Evdokimov
-
Stefan O'Rear