
All, At the recent Haskell Hackathon after ICFP, one of the issues that filled a lot of the thinking and drinking time of the Cabal hackers present was the issue of dependency analysis. I'd like to present some initial ideas and code that arose out of discussions, particularly between Lennart Kolmodin, Thomas Schilling and myself. The code I'm presenting was written jointly with them. This message is going to be one of those crazy literate Haskell programs. The code is also available here: darcs get http://haskell.org/~duncan/cabal/dep-experiment/ One thing we debated for some time was whether we needed support for dynamic or just static dependencies. The distinction here is whether one can always construct the entire dependency graph before running any expensive actions like pre-processors or compilers. If we assume that users do not need to specify every module in their package then it's easy to construct examples where it is impossible to construct the entire dependency graph without running pre-processors. So one of the requirements of this design is that we can interleave discovery of the dependency graph with running actions. We would also like to be able to specify and test the implementation. So allowing testing without having to do IO would be an advantage. We also want it to be fairly generic and extensible with respect to the kinds of actions and rules we can express. We want people to be able to extend the system with their own custom pre-processors for example. So let's dive in...
data Graph m = Graph [Rule m] [Target]
We'll use an explicit representation of a dependency graph. This representation is not efficient but it's fairly simple. We have a set of rules and a set of targets which we're currently chasing but have not yet got rules to cover. So we would start our make procedure with a graph with no rules and just the targets we wish to build.
data Rule m = Rule { targets :: [Target], depends :: [Target], action :: m [Target] }
A rule has a list of targets and a list of dependencies and an action. So a rule is not a simple edge or node in a dependency graph. It is a bundle of edges with an associated action. The reason we need multiple targets is because many build actions produce multiple output files, think .o and .hi files as the primary example. So a Rule says that there is a dependency or edge between every target and every dependency. This representation is mostly lifted from Neil Mitchel's make code. The difference is in the handling of dynamic dependencies. Here, the action can return a list of targets and this is used in our implementation of dynamic dependencies. I will come back to how that works later. Note also that the Rule and thus the Graph type are parameterised by a type m. This will be the monad that the actions work in. It will allow us to use it with a testing monad that does no IO and enables QuickCheck specifications and also later with a monad that can actually to IO and run programs.
instance Show (Rule m) where show (Rule ts ds _) = "Rule " ++ show ts ++ " " ++ show ds ++ " (<action>)"
We need to be able to show rules, though we cannot show the associated action of course.
type Target = FilePath
We've talked about Targets already without saying what they are. Initially we are using the simplification that a target is just a FilePath. We currently do not distinguish different kinds of targets, so called resolved or unresovled targets. We will need a notion of a thing that we depend on, but which we have not yet nailed down to a particular concrete object like a file. It will be necessary later to make have that notion to support search paths and other indirections. We will also later want to have more than a single name space for targets, since not every target corresponds directly to a file.
type DepGenerator m = Target -> m [Rule m]
A dependency generator is a function that given a target produces for us one or more rules that 'cover' that target. That is there is one rule that has that target in its targets list. It may also fail if there is no way to find a rule. Again the action is in some monad, so it may go and do actions in the monad like searching the file system. It presumably will also consult some rule schema that say how to make a rule for any file of a particular kind. This is one way we can separate the dependency chasing from the main make algorithm.
make :: Monad m => DepGenerator m -> Graph m -> m ()
So we come to the make algorithm. As I mentioned before it is parametrised by some Monad m. It takes one of these dependency generation functions and the current Graph. Initially we expect the Graph will have no rules and just the initial targets (which might for example be the .hi/.o files corresponding to the exposed modules in a library).
make gen (Graph [] []) = return ()
If the graph is completely empty we're done. This will always happen in the end because we remove rules when we've run their actions.
make gen (Graph rules (t:ts)) = do rules' <- gen t make gen (Graph (rules' ++ rules) ts)
We prefer expanding the graph to running actions. This is because expanding the graph gives us more choice about which actions to run later (possibly in parallel). It also means that when it is possible to generate the whole graph as in the static depgraph approach, that we do just that. So we pick the first unresolved dependency and we ask the dep generator to make some rules that tell us how to build that as a target. We add those rules to our rule set and carry on. There are some requirements on the returned rules here. For one thing they must actually 'cover' the target we were asking about. Secondly they must extend our current graph in a connected way without overlapping any targets. These are all (ill-specified) invariants of the dep graph, that it must be connected and it must not be ambiguous which rule to use to generate a particular target. No doubt there are other invariants that we should specify and check.
make gen (Graph rules []) = case selectReadyRule rules of (rule, rules') -> do dyntargets <- action rule make gen (Graph (markCompletedDeps (targets rule) dyntargets rules') (newDeps dyntargets rules'))
Ok, now for the interesting and slightly more tricky bit. If we are done for the moment expanding the graph then we turn to reducing the graph by running actions.
selectReadyRule :: [Rule m] -> (Rule m, [Rule m]) selectReadyRule rules = case partition (null . depends) rules of (rule:rules',rules'') -> (rule, rules' ++ rules'') ([], _) -> error "selectReadyRule: no rules ready to go!?!"
We first of all select a rule that is read to be run. A rule is ready if it has no remaining dependencies. This should always be the case if the graph is not empty and has no dangling dependencies and is not cyclic. This property is something our graph invariants should tell us. Note that this is where we have the opportunity for parallelism as we can pick all the rules with no dependencies. For the moment we will just pick one rule. We run the rule's action and get back a list of targets. I mentioned earlier that these targets are to do with the way we've chosen to implement dynamic dependencies. Lets look at that in more detail now. It is perhaps best explained with an example, in pseudo-make syntax: foo.o foo.hi : foo.hs $(include foo.dep) ghc -c foo.hs foo.dep : foo.hs gatherdeps foo.hs > foo.dep What we mean by this is that foo.dep will contain the dependencies of foo.hs that we discover by reading foo.hs and looking at it's imports. They are dynamic in the sense that we can only discover the dependencies once foo.hs exists, since it may be generated by a pre-processor. Then we say that foo.o foo.hi depend on all the dependencies given in foo.dep, which of course requires foo.dep to be up to date. So what happens while we're reducing this graph is that we replace the dependency foo.dep with it's contents. We do that when we run the action that brings foo.dep up to date. So let us express the above rules in our notation: r1 = Rule { targets = ["foo.o", "foo.hi"], depends = ["foo.hs", "foo.dep"], action = ghc "foo.hs" } r2 = Rule { targets = ["foo.dep"], depends = ["foo.hs"], action = return ["bar.hi"] } Obviously the dep action would not be a constant like that but would read the .hs file to find the imports, and cache them in the .dep file. But it would return them all in a list like that. Recalling where we were in our make algorithm...
make gen (Graph rules []) = case selectReadyRule rules of (rule, rules') -> do dyntargets <- action rule make gen (Graph (markCompletedDeps (targets rule) dyntargets rules') (newDeps dyntargets rules'))
So we have run the rules action which means we can assume the rule's targets are up to date. So we can go and cross off all those targets where they appear as dependencies in other rules. In addition if this action returned any dependencies then we have to go insert those in replacement of the completed target.
markCompletedDeps :: [Target] -> [Target] -> [Rule m] -> [Rule m] markCompletedDeps targets dyntargets = map updateDepends where updateDepends rule@Rule { depends = ds } | length ds == length ds' = rule | otherwise = rule { depends = dyntargets ++ ds' } where ds' = ds \\ targets
So we do that just by going through all the rules and for each one checking if any of the completed targets occurred in that rule's depends list. If it did we also insert any dynamic targets that the action may have produced.
newDeps :: [Target] -> [Rule m] -> [Target] newDeps ts rules = [ t | t <- ts, all (\rule -> t `notElem` targets rule) rules ]
These dynamic targets may also in fact be completely new targets which were not already in the graph. If this is the case then we have to find out which ones are new and add the to the list of unresolved targets in the Graph before calling make again. In that case we would go back to expanding the graph. So that's it. That's the make algorithm. Now lets try it out... We're going to implement a monad to use make on. It's going to be a simulation of a file system rather than a monad that does real IO and uses the real file system. This will make testing easier.
data State = State { currentTime :: Timestamp, filesystem :: FileSystem }
The monad will be a state monad with the state consisting of the current state of the filesystem and the current time.
type FileSystem = Map.Map FilePath File type Timestamp = Int
A filesystem is just a map of paths to files and a timestamp is just an integer. The time will increase monotonically with each interesting action on the filesystem.
data File = File { timestamp :: Timestamp, content :: [String] } deriving Show
A file has a timestamp of when it was created or last modified and it has content which we model as just a list of strings. For different kinds of files we will interpret theses strings differently.
emptyState = State 1 Map.empty
The initial state is an empty filesystem at time 1.
instance Show State where show (State _ tr) = '\n' : (unlines $ map show $ Map.keys tr)
We can show states, just listing the paths in the filesystem.
type Trace = [(Action, State)]
It will be useful in specifying the behaviour of make to have specifications that can quantify over all actions in the execution. So we have a Trace which is a history of all actions and the state of the system at the time of that action.
newtype Make a = Make { unMake :: WriterT Trace (Monad.State State) a } deriving (Functor, Monad, MonadState State, MonadWriter Trace)
Our Make monad is then a state monad with the current state and it's also a Writer monad producing the Trace.
runMake :: Make () -> Trace runMake = snd . fst . flip runState emptyState . runWriterT . unMake
Running an action in this monad gives us the action's Trace.
data Action = Stat FilePath | ReadFile FilePath | WriteFile FilePath [String] deriving Show
The actions we can get in a trace are all primitives actions on the filesystem. We distinguish stats from file reads though it is not yet clear if this is necessary.
log :: Action -> Make () log action = do state <- get tell [(action, state)]
We can insert actions into the log.
stat :: FilePath -> Make (Maybe Timestamp) stat path = do log (Stat path) State _ filesystem <- get case Map.lookup path filesystem of Just file -> return (Just (timestamp file)) Nothing -> return Nothing
Stating a file tests for its existance. If it does exist we get back the file's timestamp but not it's content.
readFile :: FilePath -> Make File readFile path = do log (ReadFile path) State _ filesystem <- get case Map.lookup path filesystem of Just file -> return file Nothing -> fail $ "file does not exist: " ++ path
Reading a file gets the File, that is the timestamp and the content.
writeFile :: FilePath -> [String] -> Make () writeFile path content = do log (WriteFile path content) State curtime filesystem <- get let file = File curtime content put $ State (curtime + 1) (Map.insert path file filesystem)
To write a file you supply the content and the timestamp gets set as the current time. The world timestamp gets incremented at this point.
exists :: FilePath -> Make Bool exists path = isJust <$> stat path
touch :: FilePath -> Make () touch path = writeFile path []
A couple convenience functions in terms of the primitives. So now we can write some programs that run in our Make monad and interact with our filesystem.
ghc :: FilePath -> Make () ghc file = case splitExtension file of (baseFile,".hs") -> do File _ imports <- readFile file mapM_ readFile (map (<.> "hi") imports) mapM_ touch $ map (baseFile <.>) ["o", "hi"] _ -> fail $ "ghc didn't get a .hs file: " ++ file
You never realised ghc was so simple eh? So it reads the .hs file. We interpret the content of the .hs file to be a list of module imports. So we map those to paths of .hi files and go and read the .hi files of the imported modules. Finally we write the output .o and .hi files. So we'll want to try it out with our make code. Let's construct an example with a couple modules, a main module foo.hs which imports module Bar from Bar.hs.
test = do writeFile "foo.hs" ["Bar"] touch "Bar.hs" make gen (Graph [] ["foo.hi"])
So there we have it, foo.hs imports Bar and Bar.hs imports nothing. We then want to make foo.hi which will involve building both modules. Of course we need our dependency generator function gen and some rule schema that say how to build .hi files.
gen :: DepGenerator Make gen target | ext == ".hi" = return [hiFileRuleSchema file ,depFileRuleSchema file ,fileExistsRuleSchema (file <.> "hs")]
where (file, ext) = splitExtension target
For this example we can assume we only have to find rules to cover .hi files. We'll generate three rules, one to compile the .hi and .o file from the .hs file. Though it also depends on the .dep file. So we need another to generate the .dep from the .hs. Finally we need a rule that generates the .hs from nothing.
fileExistsRuleSchema :: FilePath -> Rule Make fileExistsRuleSchema file = Rule { targets = [file], depends = [], action = readFile file >> return [] }
Our system does not allow dangling dependencies, so even the simple case of a source file needs a rule. Of course it is a trivial rule as it has no depends. The action asserts that the file exists and does not generate any dynamic dependencies. Note that an alternative system to caching deps in using dep files would be for the .hs rule to return the dynamic deps directly.
hiFileRuleSchema :: FilePath -> Rule Make hiFileRuleSchema file = Rule { targets = [file <.> "hi", file <.> "o"], depends = [file <.> "hs", file <.> "dep"], action = ghc (file <.> "hs") >> return [] }
So the .hi .o rule runs ghc on the .hs file.
depFileRuleSchema :: FilePath -> Rule Make depFileRuleSchema file = Rule { targets = [file <.> "dep"], depends = [file <.> "hs"], action = do yep <- exists (file <.> "dep") if yep then do File _ deps <- readFile (file <.> "dep") return deps else do File _ imports <- readFile (file <.> "hs") let deps = map (<.> "hi") imports writeFile (file <.> "dep") deps return deps }
The .dep rule makes the .dep file from the .hs file. The action is a bit more complicated. Actually I now notice that it's wrong :-). It's caching the dependencies of the .hs file in the .dep file, but it is not re-reading the dependencies if the cached ones are stale. However, in either case it returns the list of dependencies. These get substituted for the *.dep file in the hi rule. So that's the whole system. Here's the result of evaluating
map fst (runMake test)
WriteFile "foo.hs" ["Bar"] WriteFile "Bar.hs" [] ReadFile "foo.hs" Stat "foo.dep" ReadFile "foo.hs" WriteFile "foo.dep" ["Bar.hi"] ReadFile "Bar.hs" Stat "Bar.dep" ReadFile "Bar.hs" WriteFile "Bar.dep" [] ReadFile "Bar.hs" WriteFile "Bar.o" [] WriteFile "Bar.hi" [] ReadFile "foo.hs" ReadFile "Bar.hi" WriteFile "foo.o" [] WriteFile "foo.hi" [] The trace shows us discovering the dependencies of foo.hs and going and compiling Bar.hs. We could also inspect the state of the filesystem at the end, or indeed at any intermediate stage. So what's next... We want to write some specifications, both of what make should do and some internal invariants about the graph. For example what specification would detect the bug above about .dep files not being re-generated when they are stale? We should also be able to make performance specifications like saying that (any >> make >> make) extends the trace of (any >> make) with just file stats/reads and no writes. This should correspond to saying that make should be able to find out quickly that there is nothing to be done in a built tree just by checking timestamps and cached deps. We've not actually implemented checking if actions need to run or if they are already up to date with respect to their dependencies. This can be integrated into a smart Rule constructor that wraps the action with checks on the timestamps of the target and depends. It probably does not need to be integrated into the core make algorithm, it can just blindly run actions (though if we want to allow continuous builds we will have to revisit this). Again we ought to be able to construct specifications that check that we do not needlessly run actions but only do them when targets are out of date with respect to their dependencies. We should construct well known tricky examples, like search path shadowing problems and make sure our specifications catch such bugs. We currently have not looked at search paths. This makes things harder and probably requires a distinction between resolved and unresolved Targets. We currently do not cope with extra targets that we discover dynamically, like _stub.c files that pop out of compilation. We will have to track these as we at least have to link _stub.o files and in the case of c2hs producing .c files we may have to do further actions to compile them too. Then there's .hi boot files and what to do when we detect cycles. Lots of fun. So we'd appreciate comments, review and especially help with specifying, building and verifying a more realistic model before we move to an implementation using IO and invoking real programs. darcs get http://haskell.org/~duncan/cabal/dep-experiment/ patches gladly accepted. Of course there are existing systems which solve similar problems and people have had goes already at implementing dep analysis directly in Cabal. We don't want to ignore existing stuff but it's important we have something we can all understand and specify and verify. And extend. It's a tall order. Duncan