
2008/2/17, Jonathan Cast
Wild guess? If you leave o as a thunk, to be evaluated once the program has e, then it has numbers, so you keep the entire 10-million entry list in memory. Evaluating e and o in parallel allows the system to start garbage collecting cons cells from numbers much earlier, which reduces residency (I'd've been unsuprised at more than two orders of magnitude). Managing the smaller heap (and especially not having to copy numbers on each GC) then makes the garbage collector go much faster, so you get a smaller run time.
But I also tested it on P-IV 3.0 with HT and 1GB (single core) running Windows-XP (ghc 6.8.2), and it works fine (fast & low GC) in all three cases without significant difference. Sure it didn't runs faster with -N2 'cause it's not dual-core.