
On Sat, Mar 10, 2012 at 4:21 PM, Thiago Negri
c:\tmp\hs>par +RTS -s -N1 par +RTS -s -N1 20000000 803,186,152 bytes allocated in the heap 859,916,960 bytes copied during GC 233,465,740 bytes maximum residency (10 sample(s)) 30,065,860 bytes maximum slop 483 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 1523 collections, 0 parallel, 0.80s, 0.75s elapsed Generation 1: 10 collections, 0 parallel, 0.83s, 0.99s elapsed
Parallel GC work balance: nan (0 / 0, ideal 1)
c:\tmp\hs>par +RTS -s -N2 par +RTS -s -N2 20000000 1,606,279,644 bytes allocated in the heap 74,924 bytes copied during GC 28,340 bytes maximum residency (1 sample(s)) 29,004 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 1566 collections, 1565 parallel, 0.00s, 0.01s elapsed Generation 1: 1 collections, 1 parallel, 0.00s, 0.00s elapsed
Parallel GC work balance: 1.78 (15495 / 8703, ideal 2)
An important part of what happened is explained by this : -N1
483 MB total memory in use (0 MB lost due to fragmentation)
-N2
2 MB total memory in use (0 MB lost due to fragmentation)
Thing is, in the first version, the list had to be present in memory completely because you had two traversals and so the head was retained during the first traversal so that the second traversal could work on the same list. In the version where both traversals were done in parallel, the list was produced and consumed in constant memory, since both folds could progress simultaneously. So the memory use was much simpler and smaller, which must explain in part why the collections were so much faster (apparently there was still 0.01s elapsed for the generation 0 collections). -- Jedaï