parallelizing ghc

24 Jan 2012

      I recently switched from ghc --make to a parallelized build system.  I
was looking forward to faster builds, and while they are much faster
at figuring out what has to be rebuilt (which is most of the time for
a small rebuild, since ld dominates), compilation of the whole system
is either the same or slightly slower than the single threaded ghc
--make version.  My guess is that the overhead of starting up lots of
individual ghcs, each of which has to read all the .hi files all over
again, just about cancels out the parallelism gains.

So one way around that would be parallelizing --make, which has been a
TODO for a long time.  However, I believe that's never going to be
satisfactory for a project involving various different languages,
because ghc itself is never going to be a general purpose build
system.

So ghc --make provides two things: a dependency chaser and a way to
keep the compiler resident as it compiles new files.  Since the
dependency chaser will never be as powerful as a real build system, it
occurs to me that the only reasonable way forward is to split out the
second part, by adding an --interactive flag to ghc.  It would then
read filenames on stdin, compiling each one in turn, only exiting when
it sees EOF.

Then a separate program, ghc-fe, can wrap ghc and acts like a drop-in
replacement for ghc.

It would be nice if ghc could atomically read one line from the input,
then you could just start a bunch of ghcs behind a named pipe and each
would steal its own work.  But I don't think that's possible with unix
pipes, and of course there are still a few non-unix systems out there.
 And I guess ghc-fe has to wait for the compilation to finish, so I
guess ghc has to print a status line when it completes (or fails) a
module.  But it can still be done with an external distributor program
that acts like a server: starts up n ghcs, distributes src files
between them, and shuts them down then given the command:

data Ghc = Ghc { status :: Free|Busy, in :: Handle, out :: Handle, pid :: Int }

main = do
    origFlags <- getArgs
    ghcs <- mapM (startup origFlags) [0..cpus]
    socket <- accept
    while $ read socket >>= \case of
        Quit -> return False
        Compile ghcFlags src -> forkIO $
            assert $ ghcFlags == origFlags
            result <- bracket (findFreeAndMarkBusy ghcs) markFree $ \ghc -> do
                tellGhc ghc src
                readResult ghc
            write socket result
            return True
    mapM_ shutdown ghcs

The ghc-fe then starts a distributor if one is not running, sends a
src file and waits for the response, acting like a drop-in replacement
for the ghc cmdline.  Build systems just call ghc-fe and have an extra
responsibility to call ghc-fe --quit when they are done.  And I guess
if they know how many files they want to rebuild, it won't be worth it
below a certain threshold.

So I'm wondering, does this seem reasonable and feasible?  Is there a
better way to do it?  Even if it could be done, would it be worth it?
If the answers are "yes", "maybe not", and "maybe yes", then how hard
would this be to do and where should I start looking?  I'm assuming
start at GhcMake.hs and work outwards from there...

I'm not entirely sure it would be worth it to me even if it did make
full builds, say 1.5x faster for my dual core i5, but it's interesting
to think about all the same.

parallelizing ghc

Evan Laforge