proptotype of make style dep stuff

26 Oct 2007

      All,

At the recent Haskell Hackathon after ICFP, one of the issues that
filled a lot of the thinking and drinking time of the Cabal hackers
present was the issue of dependency analysis.

I'd like to present some initial ideas and code that arose out of
discussions, particularly between Lennart Kolmodin, Thomas Schilling and
myself. The code I'm presenting was written jointly with them. This
message is going to be one of those crazy literate Haskell programs. The
code is also available here:

darcs get http://haskell.org/~duncan/cabal/dep-experiment/

One thing we debated for some time was whether we needed support for
dynamic or just static dependencies. The distinction here is whether one
can always construct the entire dependency graph before running any
expensive actions like pre-processors or compilers. If we assume that
users do not need to specify every module in their package then it's
easy to construct examples where it is impossible to construct the
entire dependency graph without running pre-processors. So one of the
requirements of this design is that we can interleave discovery of the
dependency graph with running actions.

We would also like to be able to specify and test the implementation. So
allowing testing without having to do IO would be an advantage.

We also want it to be fairly generic and extensible with respect to the
kinds of actions and rules we can express. We want people to be able to
extend the system with their own custom pre-processors for example.

So let's dive in...
...
data Graph m = Graph [Rule m] [Target]
We'll use an explicit representation of a dependency graph. This
representation is not efficient but it's fairly simple. We have a set of
rules and a set of targets which we're currently chasing but have not
yet got rules to cover. So we would start our make procedure with a
graph with no rules and just the targets we wish to build.
...
data Rule m = Rule {
                targets :: [Target],
                depends :: [Target],
                action  :: m [Target]
              }
A rule has a list of targets and a list of dependencies and an action.
So a rule is not a simple edge or node in a dependency graph. It is a
bundle of edges with an associated action. The reason we need multiple
targets is because many build actions produce multiple output files,
think .o and .hi files as the primary example. So a Rule says that there
is a dependency or edge between every target and every dependency.

This representation is mostly lifted from Neil Mitchel's make code. The
difference is in the handling of dynamic dependencies. Here, the action
can return a list of targets and this is used in our implementation of
dynamic dependencies. I will come back to how that works later.

Note also that the Rule and thus the Graph type are parameterised by a
type m. This will be the monad that the actions work in. It will allow
us to use it with a testing monad that does no IO and enables QuickCheck
specifications and also later with a monad that can actually to IO and
run programs.
...
instance Show (Rule m) where
  show (Rule ts ds _) = "Rule " ++ show ts ++ " " ++ show ds ++ " (<action>)"
We need to be able to show rules, though we cannot show the associated
action of course.
...
type Target = FilePath
We've talked about Targets already without saying what they are.
Initially we are using the simplification that a target is just a
FilePath. We currently do not distinguish different kinds of targets, so
called resolved or unresovled targets. We will need a notion of a thing
that we depend on, but which we have not yet nailed down to a particular
concrete object like a file. It will be necessary later to make have
that notion to support search paths and other indirections.

We will also later want to have more than a single name space for
targets, since not every target corresponds directly to a file.
...
type DepGenerator m = Target -> m [Rule m]
A dependency generator is a function that given a target produces for us
one or more rules that 'cover' that target. That is there is one rule
that has that target in its targets list. It may also fail if there is
no way to find a rule. Again the action is in some monad, so it may go
and do actions in the monad like searching the file system. It
presumably will also consult some rule schema that say how to make a
rule for any file of a particular kind.

This is one way we can separate the dependency chasing from the main
make algorithm.
...
make :: Monad m
     => DepGenerator m
     -> Graph m
     -> m ()
So we come to the make algorithm. As I mentioned before it is
parametrised by some Monad m.

It takes one of these dependency generation functions and the current
Graph. Initially we expect the Graph will have no rules and just the
initial targets (which might for example be the .hi/.o files
corresponding to the exposed modules in a library).
...
make gen (Graph [] []) = return ()
If the graph is completely empty we're done. This will always happen in
the end because we remove rules when we've run their actions.
...
make gen (Graph rules (t:ts)) = do
  rules' <- gen t
  make gen (Graph (rules' ++ rules) ts)
We prefer expanding the graph to running actions. This is because
expanding the graph gives us more choice about which actions to run
later (possibly in parallel). It also means that when it is possible to
generate the whole graph as in the static depgraph approach, that we do
just that.

So we pick the first unresolved dependency and we ask the dep generator
to make some rules that tell us how to build that as a target. We add
those rules to our rule set and carry on.

There are some requirements on the returned rules here. For one thing
they must actually 'cover' the target we were asking about. Secondly
they must extend our current graph in a connected way without
overlapping any targets. These are all (ill-specified) invariants of the
dep graph, that it must be connected and it must not be ambiguous which
rule to use to generate a particular target. No doubt there are other
invariants that we should specify and check.
...
make gen (Graph rules []) =
 case selectReadyRule rules of
   (rule, rules') -> do
      dyntargets <- action rule
      make gen (Graph (markCompletedDeps (targets rule) dyntargets rules')
                      (newDeps dyntargets rules'))
Ok, now for the interesting and slightly more tricky bit. If we are done
for the moment expanding the graph then we turn to reducing the graph by
running actions.
...
selectReadyRule :: [Rule m] -> (Rule m, [Rule m])
selectReadyRule rules =
  case partition (null . depends) rules of
    (rule:rules',rules'') -> (rule, rules' ++ rules'')
    ([], _) -> error "selectReadyRule: no rules ready to go!?!"
We first of all select a rule that is read to be run. A rule is ready if
it has no remaining dependencies. This should always be the case if the
graph is not empty and has no dangling dependencies and is not cyclic.
This property is something our graph invariants should tell us.

Note that this is where we have the opportunity for parallelism as we
can pick all the rules with no dependencies. For the moment we will just
pick one rule.

We run the rule's action and get back a list of targets. I mentioned
earlier that these targets are to do with the way we've chosen to
implement dynamic dependencies. Lets look at that in more detail now.

It is perhaps best explained with an example, in pseudo-make syntax:

foo.o foo.hi : foo.hs $(include foo.dep)
	ghc -c foo.hs
foo.dep : foo.hs
	gatherdeps foo.hs > foo.dep

What we mean by this is that foo.dep will contain the dependencies of
foo.hs that we discover by reading foo.hs and looking at it's imports.
They are dynamic in the sense that we can only discover the dependencies
once foo.hs exists, since it may be generated by a pre-processor.

Then we say that foo.o foo.hi depend on all the dependencies given in
foo.dep, which of course requires foo.dep to be up to date. So what
happens while we're reducing this graph is that we replace the
dependency foo.dep with it's contents. We do that when we run the action
that brings foo.dep up to date.

So let us express the above rules in our notation:

r1 = Rule {
  targets = ["foo.o", "foo.hi"],
  depends = ["foo.hs", "foo.dep"],
  action  = ghc "foo.hs"
}
r2 = Rule {
  targets = ["foo.dep"],
  depends = ["foo.hs"],
  action  = return ["bar.hi"]
}

Obviously the dep action would not be a constant like that but would
read the .hs file to find the imports, and cache them in the .dep file.
But it would return them all in a list like that.

Recalling where we were in our make algorithm...
...
make gen (Graph rules []) =
 case selectReadyRule rules of
   (rule, rules') -> do
      dyntargets <- action rule
      make gen (Graph (markCompletedDeps (targets rule) dyntargets rules')
                      (newDeps dyntargets rules'))
So we have run the rules action which means we can assume the rule's
targets are up to date. So we can go and cross off all those targets
where they appear as dependencies in other rules. In addition if this
action returned any dependencies then we have to go insert those in
replacement of the completed target.
...
markCompletedDeps :: [Target] -> [Target] -> [Rule m] -> [Rule m]
markCompletedDeps targets dyntargets = map updateDepends
  where updateDepends rule@Rule { depends = ds }
          | length ds == length ds' = rule
          | otherwise               = rule { depends = dyntargets ++ ds' }
          where ds' = ds \\ targets
So we do that just by going through all the rules and for each one
checking if any of the completed targets occurred in that rule's depends
list. If it did we also insert any dynamic targets that the action may
have produced.
...
newDeps :: [Target] -> [Rule m] -> [Target]
newDeps ts rules = [ t | t <- ts, all (\rule -> t `notElem` targets rule) rules ]
These dynamic targets may also in fact be completely new targets which
were not already in the graph. If this is the case then we have to find
out which ones are new and add the to the list of unresolved targets in
the Graph before calling make again. In that case we would go back to
expanding the graph.

So that's it. That's the make algorithm. Now lets try it out...

We're going to implement a monad to use make on. It's going to be a
simulation of a file system rather than a monad that does real IO and
uses the real file system. This will make testing easier.
...
data State = State {
               currentTime :: Timestamp,
               filesystem  :: FileSystem
             }
The monad will be a state monad with the state consisting of the current
state of the filesystem and the current time.
...
type FileSystem = Map.Map FilePath File
type Timestamp = Int
A filesystem is just a map of paths to files and a timestamp is just an
integer. The time will increase monotonically with each interesting
action on the filesystem.
...
data File = File {
              timestamp :: Timestamp,
              content   :: [String]
            }
  deriving Show
A file has a timestamp of when it was created or last modified and it
has content which we model as just a list of strings. For different
kinds of files we will interpret theses strings differently.
...
emptyState = State 1 Map.empty
The initial state is an empty filesystem at time 1.
...
instance Show State where
    show (State _ tr) =
        '\n' : (unlines $ map show $ Map.keys tr)
We can show states, just listing the paths in the filesystem.
...
type Trace = [(Action, State)]
It will be useful in specifying the behaviour of make to have
specifications that can quantify over all actions in the execution. So
we have a Trace which is a history of all actions and the state of the
system at the time of that action.
...
newtype Make a = Make { unMake :: WriterT Trace (Monad.State State) a }
    deriving (Functor, Monad, MonadState State, MonadWriter Trace)
Our Make monad is then a state monad with the current state and it's
also a Writer monad producing the Trace.
...
runMake :: Make () -> Trace
runMake = 
  snd . fst . flip runState emptyState . runWriterT . unMake
Running an action in this monad gives us the action's Trace.
...
data Action = Stat FilePath
            | ReadFile  FilePath
            | WriteFile FilePath [String]
  deriving Show
The actions we can get in a trace are all primitives actions on the
filesystem. We distinguish stats from file reads though it is not yet
clear if this is necessary.
...
log :: Action -> Make ()
log action = do
  state <- get
  tell [(action, state)]
We can insert actions into the log.
...
stat :: FilePath -> Make (Maybe Timestamp)
stat path = do
  log (Stat path)
  State _ filesystem <- get
  case Map.lookup path filesystem of
    Just file -> return (Just (timestamp file))
    Nothing   -> return Nothing
Stating a file tests for its existance. If it does exist we get back the
file's timestamp but not it's content.
...
readFile :: FilePath -> Make File
readFile path = do 
  log (ReadFile path)
  State _ filesystem <- get 
  case Map.lookup path filesystem of
    Just file -> return file
    Nothing   -> fail $ "file does not exist: " ++ path
Reading a file gets the File, that is the timestamp and the content.
...
writeFile :: FilePath -> [String] -> Make ()
writeFile path content = do 
  log (WriteFile path content)
  State curtime filesystem <- get
  let file = File curtime content
  put $ State (curtime + 1) (Map.insert path file filesystem)
To write a file you supply the content and the timestamp gets set as the
current time. The world timestamp gets incremented at this point.
...
exists :: FilePath -> Make Bool
exists path = isJust <$> stat path
...
touch :: FilePath -> Make ()
touch path = writeFile path []
A couple convenience functions in terms of the primitives.

So now we can write some programs that run in our Make monad and
interact with our filesystem.
...
ghc :: FilePath -> Make ()
ghc file = 
  case splitExtension file of
    (baseFile,".hs") -> do
      File _ imports <- readFile file
      mapM_ readFile (map (<.> "hi") imports)
      mapM_ touch $ map (baseFile <.>) ["o", "hi"] 
    _ -> fail $ "ghc didn't get a .hs file: " ++ file
You never realised ghc was so simple eh? So it reads the .hs file. We
interpret the content of the .hs file to be a list of module imports. So
we map those to paths of .hi files and go and read the .hi files of the
imported modules. Finally we write the output .o and .hi files.

So we'll want to try it out with our make code. Let's construct an
example with a couple modules, a main module foo.hs which imports module
Bar from Bar.hs.
...
test = do
  writeFile "foo.hs" ["Bar"]
  touch "Bar.hs"
  make gen (Graph [] ["foo.hi"])
So there we have it, foo.hs imports Bar and Bar.hs imports nothing. We
then want to make foo.hi which will involve building both modules.

Of course we need our dependency generator function gen and some rule
schema that say how to build .hi files.
...
gen :: DepGenerator Make
gen target
  | ext == ".hi" = return [hiFileRuleSchema file
                          ,depFileRuleSchema file
                          ,fileExistsRuleSchema (file <.> "hs")]
where (file, ext) = splitExtension target
For this example we can assume we only have to find rules to cover .hi
files. We'll generate three rules, one to compile the .hi and .o file
from the .hs file. Though it also depends on the .dep file. So we need
another to generate the .dep from the .hs. Finally we need a rule that
generates the .hs from nothing.
...
fileExistsRuleSchema :: FilePath -> Rule Make
fileExistsRuleSchema file = Rule {
    targets = [file],
    depends = [],
    action  = readFile file >> return []
  }
Our system does not allow dangling dependencies, so even the simple case
of a source file needs a rule. Of course it is a trivial rule as it has
no depends. The action asserts that the file exists and does not
generate any dynamic dependencies.

Note that an alternative system to caching deps in using dep files would
be for the .hs rule to return the dynamic deps directly.
...
hiFileRuleSchema :: FilePath -> Rule Make
hiFileRuleSchema file = Rule {
    targets = [file <.> "hi", file <.> "o"],
    depends = [file <.> "hs", file <.> "dep"],
    action  = ghc (file <.> "hs") >> return []
  }
So the .hi .o rule runs ghc on the .hs file.
...
depFileRuleSchema :: FilePath -> Rule Make
depFileRuleSchema file = Rule {
    targets = [file <.> "dep"],
    depends = [file <.> "hs"],
    action  = do
      yep <- exists (file <.> "dep")
      if yep
        then do File _ deps <- readFile (file <.> "dep")
                return deps
        else do File _ imports <- readFile (file <.> "hs")
                let deps = map (<.> "hi") imports
                writeFile (file <.> "dep") deps
                return deps
  }
The .dep rule makes the .dep file from the .hs file. The action is a bit
more complicated. Actually I now notice that it's wrong :-). It's
caching the dependencies of the .hs file in the .dep file, but it is not
re-reading the dependencies if the cached ones are stale. However, in
either case it returns the list of dependencies. These get substituted
for the *.dep file in the hi rule.

So that's the whole system. Here's the result of evaluating
...
map fst (runMake test)
WriteFile "foo.hs" ["Bar"]
WriteFile "Bar.hs" []
ReadFile "foo.hs"
Stat "foo.dep"
ReadFile "foo.hs"
WriteFile "foo.dep" ["Bar.hi"]
ReadFile "Bar.hs"
Stat "Bar.dep"
ReadFile "Bar.hs"
WriteFile "Bar.dep" []
ReadFile "Bar.hs"
WriteFile "Bar.o" []
WriteFile "Bar.hi" []
ReadFile "foo.hs"
ReadFile "Bar.hi"
WriteFile "foo.o" []
WriteFile "foo.hi" []

The trace shows us discovering the dependencies of foo.hs and going and
compiling Bar.hs. We could also inspect the state of the filesystem at
the end, or indeed at any intermediate stage.

So what's next...

We want to write some specifications, both of what make should do and
some internal invariants about the graph. For example what specification
would detect the bug above about .dep files not being re-generated when
they are stale? We should also be able to make performance
specifications like saying that (any >> make >> make) extends the trace
of (any >> make) with just file stats/reads and no writes. This should
correspond to saying that make should be able to find out quickly that
there is nothing to be done in a built tree just by checking timestamps
and cached deps.

We've not actually implemented checking if actions need to run or if
they are already up to date with respect to their dependencies. This can
be integrated into a smart Rule constructor that wraps the action with
checks on the timestamps of the target and depends. It probably does not
need to be integrated into the core make algorithm, it can just blindly
run actions (though if we want to allow continuous builds we will have
to revisit this). Again we ought to be able to construct specifications
that check that we do not needlessly run actions but only do them when
targets are out of date with respect to their dependencies.

We should construct well known tricky examples, like search path
shadowing problems and make sure our specifications catch such bugs.

We currently have not looked at search paths. This makes things harder
and probably requires a distinction between resolved and unresolved
Targets.

We currently do not cope with extra targets that we discover
dynamically, like _stub.c files that pop out of compilation. We will
have to track these as we at least have to link _stub.o files and in the
case of c2hs producing .c files we may have to do further actions to
compile them too. Then there's .hi boot files and what to do when we
detect cycles. Lots of fun.

So we'd appreciate comments, review and especially help with specifying,
building and verifying a more realistic model before we move to an
implementation using IO and invoking real programs.

darcs get http://haskell.org/~duncan/cabal/dep-experiment/

patches gladly accepted.

Of course there are existing systems which solve similar problems and
people have had goes already at implementing dep analysis directly in
Cabal. We don't want to ignore existing stuff but it's important we have
something we can all understand and specify and verify. And extend. It's
a tall order.

Duncan

proptotype of make style dep stuff

Duncan Coutts