 
            I'm slightly surprised by this - in my experience parallel builds beat --make as long as the parallelism is a factor of 2 or more. Is your dependency graph very narrow, or do you have lots of very small modules?
From scratch, --make (that's what 'make -j3' winds up calling) wins slightly. --make loses handily at detecting than nothing need be done :) And as expected, modifying one file is all about the linking,
I get full parallelism, 4 threads at once on a 2 core machine * 2 hyperthread/whatever core i5, and SSD. Maybe I should try with just 2 threads. I only ever get 200% CPU at most, so it seems like the hyperthreads are not really much like a whole core. The modules are usually around 150-250 lines. Here are the timings for an older run: from scratch (191 modules): runghc Shake/Shakefile.hs build/debug/seq 128.43s user 20.04s system 178% cpu 1:23.01 total no link: runghc Shake/Shakefile.hs build/debug/seq 118.92s user 19.21s sys tem 249% cpu 55.383 total make -j3 build/seq 68.81s user 9.98s system 98% cpu 1:19.60 total modify nothing: runghc Shake/Shakefile.hs build/debug/seq 0.65s user 0.10s system 96% cpu 0.780 total make -j3 build/seq 6.05s user 1.21s system 85% cpu 8.492 total modify one file: runghc Shake/Shakefile.hs build/debug/seq 19.50s user 2.37s system 94% cpu 23.166 total make -j3 build/seq 12.81s user 1.85s system 94% cpu 15.586 total though it's odd how --make was faster.
I like the idea! And it should be possible to build this without modifying GHC at all, on top of the GHC API. As you say, you'll need a server process, which accepts command lines, executes them, and sends back the results. A local socket should be fine (and will work on both Unix and Windows).
The server process can either do the compilation itself, or have several workers. Unfortunately the workers would have to be separate processes, because the GHC API is single threaded.
When a worker gets too large, just kill it and start a new one.
A benefit of real processes, I'm pretty confident all the memory will be GCed after the whole process is killed :) I'll start looking into the ghc api. I have no experience with it, but I assume I can look at what GhcMake.hs is doing and learn from that.