
akborder:
The threaded version running on 2 cores is moderately faster than the serial one:
$ ./Parser +RTS -s -N2 2,377,165,256 bytes allocated in the heap 36,320,944 bytes copied during GC 6,020,720 bytes maximum residency (6 sample(s)) 6,933,928 bytes maximum slop 21 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2410 collections, 0 parallel, 0.33s, 0.34s elapsed Generation 1: 6 collections, 4 parallel, 0.06s, 0.05s elapsed
Parallel GC work balance: 1.83 (2314641 / 1265968, ideal 2)
Task 0 (worker) : MUT time: 2.43s ( 1.19s elapsed) GC time: 0.02s ( 0.02s elapsed)
Task 1 (worker) : MUT time: 2.15s ( 1.19s elapsed) GC time: 0.29s ( 0.30s elapsed)
Task 2 (worker) : MUT time: 2.37s ( 1.19s elapsed) GC time: 0.07s ( 0.08s elapsed)
Task 3 (worker) : MUT time: 2.45s ( 1.19s elapsed) GC time: 0.00s ( 0.00s elapsed)
INIT time 0.00s ( 0.00s elapsed) MUT time 2.06s ( 1.19s elapsed) GC time 0.39s ( 0.39s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.45s ( 1.58s elapsed)
%GC time 15.7% (24.9% elapsed)
Alloc rate 1,151,990,234 bytes per MUT second
Productivity 84.2% of total user, 130.2% of total elapsed
The speedup is smaller than what I was expecting given that each unit of work (250 input lines) is completely independent from the others. Changing the size of each work unit did not help; garbage collection times are small enough that increasing the minimum heap size did not produce any speedup either.
Is there anything else I can do to understand why the parallel map does not provide a significant speedup?
Very interesting idea! I think the big thing would be to measure it with GHC HEAD so you can see how effectively the sparks are being converted into threads. Is there a package and test case somewhere we can try out? -- Don