proptotype of make style dep stuff

All, At the recent Haskell Hackathon after ICFP, one of the issues that filled a lot of the thinking and drinking time of the Cabal hackers present was the issue of dependency analysis. I'd like to present some initial ideas and code that arose out of discussions, particularly between Lennart Kolmodin, Thomas Schilling and myself. The code I'm presenting was written jointly with them. This message is going to be one of those crazy literate Haskell programs. The code is also available here: darcs get http://haskell.org/~duncan/cabal/dep-experiment/ One thing we debated for some time was whether we needed support for dynamic or just static dependencies. The distinction here is whether one can always construct the entire dependency graph before running any expensive actions like pre-processors or compilers. If we assume that users do not need to specify every module in their package then it's easy to construct examples where it is impossible to construct the entire dependency graph without running pre-processors. So one of the requirements of this design is that we can interleave discovery of the dependency graph with running actions. We would also like to be able to specify and test the implementation. So allowing testing without having to do IO would be an advantage. We also want it to be fairly generic and extensible with respect to the kinds of actions and rules we can express. We want people to be able to extend the system with their own custom pre-processors for example. So let's dive in...
data Graph m = Graph [Rule m] [Target]
We'll use an explicit representation of a dependency graph. This representation is not efficient but it's fairly simple. We have a set of rules and a set of targets which we're currently chasing but have not yet got rules to cover. So we would start our make procedure with a graph with no rules and just the targets we wish to build.
data Rule m = Rule { targets :: [Target], depends :: [Target], action :: m [Target] }
A rule has a list of targets and a list of dependencies and an action. So a rule is not a simple edge or node in a dependency graph. It is a bundle of edges with an associated action. The reason we need multiple targets is because many build actions produce multiple output files, think .o and .hi files as the primary example. So a Rule says that there is a dependency or edge between every target and every dependency. This representation is mostly lifted from Neil Mitchel's make code. The difference is in the handling of dynamic dependencies. Here, the action can return a list of targets and this is used in our implementation of dynamic dependencies. I will come back to how that works later. Note also that the Rule and thus the Graph type are parameterised by a type m. This will be the monad that the actions work in. It will allow us to use it with a testing monad that does no IO and enables QuickCheck specifications and also later with a monad that can actually to IO and run programs.
instance Show (Rule m) where show (Rule ts ds _) = "Rule " ++ show ts ++ " " ++ show ds ++ " (<action>)"
We need to be able to show rules, though we cannot show the associated action of course.
type Target = FilePath
We've talked about Targets already without saying what they are. Initially we are using the simplification that a target is just a FilePath. We currently do not distinguish different kinds of targets, so called resolved or unresovled targets. We will need a notion of a thing that we depend on, but which we have not yet nailed down to a particular concrete object like a file. It will be necessary later to make have that notion to support search paths and other indirections. We will also later want to have more than a single name space for targets, since not every target corresponds directly to a file.
type DepGenerator m = Target -> m [Rule m]
A dependency generator is a function that given a target produces for us one or more rules that 'cover' that target. That is there is one rule that has that target in its targets list. It may also fail if there is no way to find a rule. Again the action is in some monad, so it may go and do actions in the monad like searching the file system. It presumably will also consult some rule schema that say how to make a rule for any file of a particular kind. This is one way we can separate the dependency chasing from the main make algorithm.
make :: Monad m => DepGenerator m -> Graph m -> m ()
So we come to the make algorithm. As I mentioned before it is parametrised by some Monad m. It takes one of these dependency generation functions and the current Graph. Initially we expect the Graph will have no rules and just the initial targets (which might for example be the .hi/.o files corresponding to the exposed modules in a library).
make gen (Graph [] []) = return ()
If the graph is completely empty we're done. This will always happen in the end because we remove rules when we've run their actions.
make gen (Graph rules (t:ts)) = do rules' <- gen t make gen (Graph (rules' ++ rules) ts)
We prefer expanding the graph to running actions. This is because expanding the graph gives us more choice about which actions to run later (possibly in parallel). It also means that when it is possible to generate the whole graph as in the static depgraph approach, that we do just that. So we pick the first unresolved dependency and we ask the dep generator to make some rules that tell us how to build that as a target. We add those rules to our rule set and carry on. There are some requirements on the returned rules here. For one thing they must actually 'cover' the target we were asking about. Secondly they must extend our current graph in a connected way without overlapping any targets. These are all (ill-specified) invariants of the dep graph, that it must be connected and it must not be ambiguous which rule to use to generate a particular target. No doubt there are other invariants that we should specify and check.
make gen (Graph rules []) = case selectReadyRule rules of (rule, rules') -> do dyntargets <- action rule make gen (Graph (markCompletedDeps (targets rule) dyntargets rules') (newDeps dyntargets rules'))
Ok, now for the interesting and slightly more tricky bit. If we are done for the moment expanding the graph then we turn to reducing the graph by running actions.
selectReadyRule :: [Rule m] -> (Rule m, [Rule m]) selectReadyRule rules = case partition (null . depends) rules of (rule:rules',rules'') -> (rule, rules' ++ rules'') ([], _) -> error "selectReadyRule: no rules ready to go!?!"
We first of all select a rule that is read to be run. A rule is ready if it has no remaining dependencies. This should always be the case if the graph is not empty and has no dangling dependencies and is not cyclic. This property is something our graph invariants should tell us. Note that this is where we have the opportunity for parallelism as we can pick all the rules with no dependencies. For the moment we will just pick one rule. We run the rule's action and get back a list of targets. I mentioned earlier that these targets are to do with the way we've chosen to implement dynamic dependencies. Lets look at that in more detail now. It is perhaps best explained with an example, in pseudo-make syntax: foo.o foo.hi : foo.hs $(include foo.dep) ghc -c foo.hs foo.dep : foo.hs gatherdeps foo.hs > foo.dep What we mean by this is that foo.dep will contain the dependencies of foo.hs that we discover by reading foo.hs and looking at it's imports. They are dynamic in the sense that we can only discover the dependencies once foo.hs exists, since it may be generated by a pre-processor. Then we say that foo.o foo.hi depend on all the dependencies given in foo.dep, which of course requires foo.dep to be up to date. So what happens while we're reducing this graph is that we replace the dependency foo.dep with it's contents. We do that when we run the action that brings foo.dep up to date. So let us express the above rules in our notation: r1 = Rule { targets = ["foo.o", "foo.hi"], depends = ["foo.hs", "foo.dep"], action = ghc "foo.hs" } r2 = Rule { targets = ["foo.dep"], depends = ["foo.hs"], action = return ["bar.hi"] } Obviously the dep action would not be a constant like that but would read the .hs file to find the imports, and cache them in the .dep file. But it would return them all in a list like that. Recalling where we were in our make algorithm...
make gen (Graph rules []) = case selectReadyRule rules of (rule, rules') -> do dyntargets <- action rule make gen (Graph (markCompletedDeps (targets rule) dyntargets rules') (newDeps dyntargets rules'))
So we have run the rules action which means we can assume the rule's targets are up to date. So we can go and cross off all those targets where they appear as dependencies in other rules. In addition if this action returned any dependencies then we have to go insert those in replacement of the completed target.
markCompletedDeps :: [Target] -> [Target] -> [Rule m] -> [Rule m] markCompletedDeps targets dyntargets = map updateDepends where updateDepends rule@Rule { depends = ds } | length ds == length ds' = rule | otherwise = rule { depends = dyntargets ++ ds' } where ds' = ds \\ targets
So we do that just by going through all the rules and for each one checking if any of the completed targets occurred in that rule's depends list. If it did we also insert any dynamic targets that the action may have produced.
newDeps :: [Target] -> [Rule m] -> [Target] newDeps ts rules = [ t | t <- ts, all (\rule -> t `notElem` targets rule) rules ]
These dynamic targets may also in fact be completely new targets which were not already in the graph. If this is the case then we have to find out which ones are new and add the to the list of unresolved targets in the Graph before calling make again. In that case we would go back to expanding the graph. So that's it. That's the make algorithm. Now lets try it out... We're going to implement a monad to use make on. It's going to be a simulation of a file system rather than a monad that does real IO and uses the real file system. This will make testing easier.
data State = State { currentTime :: Timestamp, filesystem :: FileSystem }
The monad will be a state monad with the state consisting of the current state of the filesystem and the current time.
type FileSystem = Map.Map FilePath File type Timestamp = Int
A filesystem is just a map of paths to files and a timestamp is just an integer. The time will increase monotonically with each interesting action on the filesystem.
data File = File { timestamp :: Timestamp, content :: [String] } deriving Show
A file has a timestamp of when it was created or last modified and it has content which we model as just a list of strings. For different kinds of files we will interpret theses strings differently.
emptyState = State 1 Map.empty
The initial state is an empty filesystem at time 1.
instance Show State where show (State _ tr) = '\n' : (unlines $ map show $ Map.keys tr)
We can show states, just listing the paths in the filesystem.
type Trace = [(Action, State)]
It will be useful in specifying the behaviour of make to have specifications that can quantify over all actions in the execution. So we have a Trace which is a history of all actions and the state of the system at the time of that action.
newtype Make a = Make { unMake :: WriterT Trace (Monad.State State) a } deriving (Functor, Monad, MonadState State, MonadWriter Trace)
Our Make monad is then a state monad with the current state and it's also a Writer monad producing the Trace.
runMake :: Make () -> Trace runMake = snd . fst . flip runState emptyState . runWriterT . unMake
Running an action in this monad gives us the action's Trace.
data Action = Stat FilePath | ReadFile FilePath | WriteFile FilePath [String] deriving Show
The actions we can get in a trace are all primitives actions on the filesystem. We distinguish stats from file reads though it is not yet clear if this is necessary.
log :: Action -> Make () log action = do state <- get tell [(action, state)]
We can insert actions into the log.
stat :: FilePath -> Make (Maybe Timestamp) stat path = do log (Stat path) State _ filesystem <- get case Map.lookup path filesystem of Just file -> return (Just (timestamp file)) Nothing -> return Nothing
Stating a file tests for its existance. If it does exist we get back the file's timestamp but not it's content.
readFile :: FilePath -> Make File readFile path = do log (ReadFile path) State _ filesystem <- get case Map.lookup path filesystem of Just file -> return file Nothing -> fail $ "file does not exist: " ++ path
Reading a file gets the File, that is the timestamp and the content.
writeFile :: FilePath -> [String] -> Make () writeFile path content = do log (WriteFile path content) State curtime filesystem <- get let file = File curtime content put $ State (curtime + 1) (Map.insert path file filesystem)
To write a file you supply the content and the timestamp gets set as the current time. The world timestamp gets incremented at this point.
exists :: FilePath -> Make Bool exists path = isJust <$> stat path
touch :: FilePath -> Make () touch path = writeFile path []
A couple convenience functions in terms of the primitives. So now we can write some programs that run in our Make monad and interact with our filesystem.
ghc :: FilePath -> Make () ghc file = case splitExtension file of (baseFile,".hs") -> do File _ imports <- readFile file mapM_ readFile (map (<.> "hi") imports) mapM_ touch $ map (baseFile <.>) ["o", "hi"] _ -> fail $ "ghc didn't get a .hs file: " ++ file
You never realised ghc was so simple eh? So it reads the .hs file. We interpret the content of the .hs file to be a list of module imports. So we map those to paths of .hi files and go and read the .hi files of the imported modules. Finally we write the output .o and .hi files. So we'll want to try it out with our make code. Let's construct an example with a couple modules, a main module foo.hs which imports module Bar from Bar.hs.
test = do writeFile "foo.hs" ["Bar"] touch "Bar.hs" make gen (Graph [] ["foo.hi"])
So there we have it, foo.hs imports Bar and Bar.hs imports nothing. We then want to make foo.hi which will involve building both modules. Of course we need our dependency generator function gen and some rule schema that say how to build .hi files.
gen :: DepGenerator Make gen target | ext == ".hi" = return [hiFileRuleSchema file ,depFileRuleSchema file ,fileExistsRuleSchema (file <.> "hs")]
where (file, ext) = splitExtension target
For this example we can assume we only have to find rules to cover .hi files. We'll generate three rules, one to compile the .hi and .o file from the .hs file. Though it also depends on the .dep file. So we need another to generate the .dep from the .hs. Finally we need a rule that generates the .hs from nothing.
fileExistsRuleSchema :: FilePath -> Rule Make fileExistsRuleSchema file = Rule { targets = [file], depends = [], action = readFile file >> return [] }
Our system does not allow dangling dependencies, so even the simple case of a source file needs a rule. Of course it is a trivial rule as it has no depends. The action asserts that the file exists and does not generate any dynamic dependencies. Note that an alternative system to caching deps in using dep files would be for the .hs rule to return the dynamic deps directly.
hiFileRuleSchema :: FilePath -> Rule Make hiFileRuleSchema file = Rule { targets = [file <.> "hi", file <.> "o"], depends = [file <.> "hs", file <.> "dep"], action = ghc (file <.> "hs") >> return [] }
So the .hi .o rule runs ghc on the .hs file.
depFileRuleSchema :: FilePath -> Rule Make depFileRuleSchema file = Rule { targets = [file <.> "dep"], depends = [file <.> "hs"], action = do yep <- exists (file <.> "dep") if yep then do File _ deps <- readFile (file <.> "dep") return deps else do File _ imports <- readFile (file <.> "hs") let deps = map (<.> "hi") imports writeFile (file <.> "dep") deps return deps }
The .dep rule makes the .dep file from the .hs file. The action is a bit more complicated. Actually I now notice that it's wrong :-). It's caching the dependencies of the .hs file in the .dep file, but it is not re-reading the dependencies if the cached ones are stale. However, in either case it returns the list of dependencies. These get substituted for the *.dep file in the hi rule. So that's the whole system. Here's the result of evaluating
map fst (runMake test)
WriteFile "foo.hs" ["Bar"] WriteFile "Bar.hs" [] ReadFile "foo.hs" Stat "foo.dep" ReadFile "foo.hs" WriteFile "foo.dep" ["Bar.hi"] ReadFile "Bar.hs" Stat "Bar.dep" ReadFile "Bar.hs" WriteFile "Bar.dep" [] ReadFile "Bar.hs" WriteFile "Bar.o" [] WriteFile "Bar.hi" [] ReadFile "foo.hs" ReadFile "Bar.hi" WriteFile "foo.o" [] WriteFile "foo.hi" [] The trace shows us discovering the dependencies of foo.hs and going and compiling Bar.hs. We could also inspect the state of the filesystem at the end, or indeed at any intermediate stage. So what's next... We want to write some specifications, both of what make should do and some internal invariants about the graph. For example what specification would detect the bug above about .dep files not being re-generated when they are stale? We should also be able to make performance specifications like saying that (any >> make >> make) extends the trace of (any >> make) with just file stats/reads and no writes. This should correspond to saying that make should be able to find out quickly that there is nothing to be done in a built tree just by checking timestamps and cached deps. We've not actually implemented checking if actions need to run or if they are already up to date with respect to their dependencies. This can be integrated into a smart Rule constructor that wraps the action with checks on the timestamps of the target and depends. It probably does not need to be integrated into the core make algorithm, it can just blindly run actions (though if we want to allow continuous builds we will have to revisit this). Again we ought to be able to construct specifications that check that we do not needlessly run actions but only do them when targets are out of date with respect to their dependencies. We should construct well known tricky examples, like search path shadowing problems and make sure our specifications catch such bugs. We currently have not looked at search paths. This makes things harder and probably requires a distinction between resolved and unresolved Targets. We currently do not cope with extra targets that we discover dynamically, like _stub.c files that pop out of compilation. We will have to track these as we at least have to link _stub.o files and in the case of c2hs producing .c files we may have to do further actions to compile them too. Then there's .hi boot files and what to do when we detect cycles. Lots of fun. So we'd appreciate comments, review and especially help with specifying, building and verifying a more realistic model before we move to an implementation using IO and invoking real programs. darcs get http://haskell.org/~duncan/cabal/dep-experiment/ patches gladly accepted. Of course there are existing systems which solve similar problems and people have had goes already at implementing dep analysis directly in Cabal. We don't want to ignore existing stuff but it's important we have something we can all understand and specify and verify. And extend. It's a tall order. Duncan

Duncan and others, I have some high-level thoughts: = Standalone Tool = Something I've talked about for a long time (and a few times tried to get someone to implement) is a kind of EDSL for Setup files so that people can replace Make more easily by using the Setup files, and this seems like it could be a good start. You could use Parsec as a model for building a little language for dependencies and actions for those dependencies. Make is a mess because it's really hard to build any kind of abstractions using it; the main thing it does for you is the Make algorithm you've shown here, and everything else is really hard. Haskell would be much better at this :) Cabal is getting more and more complex, and I think it would be good to consider how you might separate this from Cabal. Perhaps you could implement a make-a-like library that's pretty standalone, but would be designed for use by Cabal, and indeed, Cabal would probably use it internally, but it could also be used outside of the context of Cabal. = Pros & Cons of Cabal Complexity = I've tended to resist adding complexity like RPMs to Cabal because I tend to think that smaller, standalone tools that can be elegantly composed is better. The good part of composable standalone tools is that you can spread out the development effort among folks who are experts in their smaller tool. You also can keep Cabal from being so complicated that it's hard to bootstrap packages, and so that the changes between versions aren't as painful. The downside of keeping Cabal simple is that you'd sorta like it to be the case that Cabal is all you need to build any package. As packages and dependencies among packages get more complex (in part, because Cabal facilitates that, which is good) it's natural to make Cabal more complex as well to accommodate more and more needs of package authors. My only point, really, is that you think about these trade-offs whenever adding new capabilities to Cabal. Would it be better to have a standalone tool that Cabal depends on? = Goals of Dependency Chasing = I've been a bit out of the loop on this, so I apologize if I'm re-covering old ground. A simpler goal of a module-chasing system is just to replace the need for other-modules. Other-modules is used in compilation (although ghc --make doesn't really need it) and for haddock and for making source tarballs, and probably for other things. Ideally, the user should be able to specify only "Exposed modules" for libraries and "main modules" for executables, and Cabal should derive the other-modules itself. That's less pain for the user, and you can get around using ghc --make. Well, using ghc --make isn't much of a problem, but other compilers without a --make flag could then be made to work. Maybe this is implemented already? At first this sounds really easy, but then I wonder, how can Cabal know whether a particular module in an import list is a part of a dependency, or actually a part of the package? Can Cabal query ghc-pkg for that kind of information? Can Cabal assume that if it can find the module in the local directory tree that it is a part of the current package? Can you think of another way to make this work? = Overall = Overall, this sounds really cool, and I encourage you all to keep at it! I trust you to decide if adding this kind of complexity to Cabal is worthwhile, and to build a really nice system. I really do think that Haskell should be able to do better than Make here, and it would be a pleasure to feel like we can replace bits of that toolchain :) I'll try to read in more detail and give more feedback. peace, isaac

On Fri, 2007-10-26 at 10:20 -0700, Isaac Potoczny-Jones wrote:
Duncan and others,
I have some high-level thoughts:
= Standalone Tool =
Something I've talked about for a long time (and a few times tried to get someone to implement) is a kind of EDSL for Setup files so that people can replace Make more easily by using the Setup files, and this seems like it could be a good start. You could use Parsec as a model for building a little language for dependencies and actions for those dependencies. Make is a mess because it's really hard to build any kind of abstractions using it; the main thing it does for you is the Make algorithm you've shown here, and everything else is really hard. Haskell would be much better at this :)
Aye, I've used make in Gtk2Hs and it's pretty good. The main thing that lets it down is the bad language, well languages. The shell language plus the limited functional language on top. That and make does not handle dynamic dependencies well. So doing make properly would be a good thing since the basic principle is sound. A Cabal configure/build EDSL for the make actions and indeed for most of Cabal would also be a good thing. Some people have given it some consideration. Doing that well should help to reduce complexity and code size in Cabal.
Cabal is getting more and more complex, and I think it would be good to consider how you might separate this from Cabal. Perhaps you could implement a make-a-like library that's pretty standalone, but would be designed for use by Cabal, and indeed, Cabal would probably use it internally, but it could also be used outside of the context of Cabal.
We do have to be careful about dependencies. And I think it might actually be good to integrate this into Cabal directly and change some of the existing imperative code in Cabal to work more in a dependency driven style. I would hope the make code would be reusable if we design it properly.
= Pros & Cons of Cabal Complexity =
I've tended to resist adding complexity like RPMs to Cabal because I tend to think that smaller, standalone tools that can be elegantly composed is better. The good part of composable standalone tools is that you can spread out the development effort among folks who are experts in their smaller tool. You also can keep Cabal from being so complicated that it's hard to bootstrap packages, and so that the changes between versions aren't as painful.
Keeping the core closer to what is needed to bootstrap is a good point. We can perhaps move some tasks to tools built on the Cabal library, like cabal-install. For example there's no need for the sdist feature to be in Cabal directly, we could move it into cabal-install, at least if we make cabal-install the primare command line UI that developers use. That way it can have more dependencies like, zlib and tar to be able to create and unpack .tar.gz file. I've also wondered if the command line UI could be completely separated, though that could make bootstrapping too hard. It could certainly be better separated internally in the Cabal library and that would make us think more clearly about what interface Cabal the library provides, rather than Cabal as a command line tool via Setup.hs.
The downside of keeping Cabal simple is that you'd sorta like it to be the case that Cabal is all you need to build any package. As packages and dependencies among packages get more complex (in part, because Cabal facilitates that, which is good) it's natural to make Cabal more complex as well to accommodate more and more needs of package authors.
My only point, really, is that you think about these trade-offs whenever adding new capabilities to Cabal. Would it be better to have a standalone tool that Cabal depends on?
As I say, I'd go the other way and have a 'cabal' command line tool that depends on the Cabal library and provides more convenience features to developers and users. The cabal-install program is going in that direction.
= Goals of Dependency Chasing =
I've been a bit out of the loop on this, so I apologize if I'm re-covering old ground.
A simpler goal of a module-chasing system is just to replace the need for other-modules. Other-modules is used in compilation (although ghc --make doesn't really need it) and for haddock and for making source tarballs, and probably for other things.
Yes. Note that although ghc --make does not need it, if you miss out any module from other-modules you'll get nasty link errors later. Much later, in a rather confusing way. So that'd be another thing we could fix.
Ideally, the user should be able to specify only "Exposed modules" for libraries and "main modules" for executables, and Cabal should derive the other-modules itself. That's less pain for the user, and you can get around using ghc --make. Well, using ghc --make isn't much of a problem, but other compilers without a --make flag could then be made to work. Maybe this is implemented already?
Dependency chasing is essential to support pre-processors properly, like c2hs, and also for compilers that do not implement their own dependency chasing. It's also important for being able to do parallel builds which will be increasingly important. I also think it'll make it easier for us to support building a collection of related packages in one go.
At first this sounds really easy, but then I wonder, how can Cabal know whether a particular module in an import list is a part of a dependency, or actually a part of the package? Can Cabal query ghc-pkg for that kind of information?
Yes.
Can Cabal assume that if it can find the module in the local directory tree that it is a part of the current package?
Yes. That's ghc's behaviour that local modules shadow package modules.
Can you think of another way to make this work?
We could do with making getting the information out of ghc-pkg more efficient by allowing us to get all the information in one call. Simon suggested something like ghc-pkg dump that just gives us all the info on all packages in one go.
= Overall =
Overall, this sounds really cool, and I encourage you all to keep at it! I trust you to decide if adding this kind of complexity to Cabal is worthwhile, and to build a really nice system. I really do think that Haskell should be able to do better than Make here, and it would be a pleasure to feel like we can replace bits of that toolchain :)
:-)
I'll try to read in more detail and give more feedback.
Great. Duncan

I've also wondered if the command line UI could be completely separated, though that could make bootstrapping too hard. You could distribute a "bootstrapping" version of cabal shipping with the command line UI as hidden modules. If you build cabal later on you could just shipt without the files for the command line UI, then the external command line gui would be used. Of course the local bootstrapping copy must be updated when the command line gui lib is changed. Not sure wether this makes much sense then.
Marc

On Fri, 2007-10-26 at 10:48 +0000, Duncan Coutts wrote:
So what's next...
We want to write some specifications, both of what make should do and some internal invariants about the graph.
A smattering of informal descriptions of properties: "globally up to date" Globally, the state after make has run should be such that every target is up to date with respect to it's dependencies. Thomas suggests a good way to check that that does not rely on the internals of the make algorithm or types is to generate a tree, write out corresponding files (in the pure Make monad) and supply a gen function based on the generated tree. Then we can check after make if all dependencies are up to date. So the point it we generate and keep a simple external dep tree (graph? dag?) and check make against that. "only rebuild what's necessary" A similar property is to say that after a build and touching a few files, we should only rebuild the bits that are necessary. For every action run, before the action is run there must exist a target and a dependency that are out of date with respect to each other. So actually this is expressed as a local property, not on the state after make has run. Perhaps we should express the "globally up to date" property in a local style. It might make pin-pointing offenders easier. Or do both. "no untracked dependencies" To a first approximation, an action should only read files that are listed as dependencies and only write files that are listed as targets. A better approximation would allow creating and deleting temporary files (files that did not previously exist and are not targets or dependencies in the graph), reading files that are indirect dependencies (think .hi files) and reading some "global constant system files" like libc.so, locale definitions etc. We can check this property if we record the beginning and end of actions in the Trace and include the complete list of dependencies and targets in the trace event (including dynamic dependencies and targets). Note that this property is something we really can check in a real build (at least on linux) using strace to get a trace of system calls. "only necessary dependencies" The opposite property is that there are no unnecessary dependencies or targets, that is listed dependencies/targets that are never read/written. It's not clear this is so important or will always be true. It is probably possible to have conditionals such that files are used or not depending on something. So if we go for this one we'd probably need per-rule exception lists/properties. Duncan

On Fri, 2007-10-26 at 10:48 +0000, Duncan Coutts wrote:
Note that this is where we have the opportunity for parallelism as we can pick all the rules with no dependencies. For the moment we will just pick one rule.
Speaking of parallel builds... Spencer Janssen, Lennart Kolmodin and I were discussing possible approaches to integrate parallel builds into the this framework. Some requirements we thought of were that it should fit with the approach of being agnostic/parametrised by the underlying build monad. It should not rely on any actual parallelism happening, only on there being parallelism available for the underlying build monad to exploit. We'd like something simpler than forkIO + MVars. We'd like to be able to implement simple single threaded versions easily. We'd like to be able to test/simulate different schedulers in the pure context to be sure that we have no bugs that depend on the order of parallel jobs. So it should be possible for us to implement demonic job schedulers for testing. Spencer suggested an api based on launching jobs and collecting completed jobs: So a JobControl m a gives us a way to launch jobs in a monad 'm' that have result type 'a': data JobControl m a = JobControl { launch :: m a -> m (), collect :: m a } We'd then pass one of these as a parameter to make, just like we already pass a DepGenerator. Just like the DepGenerator it's actions will be in the same monad as we're working in overall. make :: Monad m => DepGenerator m -> JobControl m (Something) -> Graph m -> m () For make we'd be using some specific type for the jobs. It'd probably be something like the result of Rule actions, paired with enough info to identify which rule it is that was completed. Informally, the semantics of launch / collect would be... We use launch to create a new job. We use collect to get the result of *any* launched job. Launched jobs may complete in any order. Collect when there are no launched jobs is an error (if this proves inconvenient we could change this to have collect return Nothing when there are no jobs). If a job fails in the monad, the failure is returned in the monad when the job is collected. So a simple serial implementation might look like: serialJobControl :: MonadState [m a] m => JobControl m a serialJobControl = JobControl { launch = \job -> modify (job:), collect = do (job:jobs) <- get put jobs job } So that just maintains a stack of jobs, adds to the stack on launch and runs a job when we collect. It's not in order, but that does not matter since any order will do. An implementation in a monad based on IO would use an implementation using forkIO and MVars. It'd probably just fork each new launched job. It'd probably use a semaphore to control the degree of parallelism. Results of completed jobs could be submitted to a channel which collect could read from. One thing all this depends on in practise is the ability to launch multiple child processes and wait for the first one to complete without blocking the whole process. Currently that's not something we can do with the portable System.Process module. The waitForProcess function blocks the entire process. Even with ghc's threaded rts we can only do it at the cost of one OS thread for each process we are waiting on. That does not seem like a sensible model. It really should be possible to manage a collection of child processes without using any OS level concurrency. In ghc on unix we should be able to do this by arranging for SIGCHILD signals to be sent when child processes complete. Duncan

On Fri, 2007-10-26 at 10:48 +0000, Duncan Coutts wrote:
So what's next...
We want to write some specifications
Spencer, Lennart, Thomas and I had a joint hacking session in which we made some progress on this issue today. Specifically we can now generate random dep graphs for use in QuickCheck tests. For example: http://haskell.org/~duncan/cabal/foo.svg We can tune the size and 'density' of these graphs. As you can see we get cycles in the graphs. We'll have to filter out cycles. Any suggestions on a simple way to do that? It's ok to depend on Data.Graph for this kind of test code. So the first testing strategy we're going for is to generate random dep graphs. From there we will write out a suitable selection of the files and run the make algorithm. We then inspect the resulting event trace and final state to check that everything worked correctly. So we'll be making the dependency generator from this random dep graph too. So this is for testing make in isolation. For the next bit of testing our rule sets for building Haskell projects we'll have to generate dep graphs that use files mentioned in our rule sets. That's probably a bit harder. We're using a trivial representation of the random dep graphs:
newtype TestDepGraph = TestDepGraph [Dep]
data Dep = Target :<= Target deriving (Eq, Show)
foo :<= bar means that we generate foo from bar So an example property might look something like:
prop_makeUpToDate :: Property prop_makeUpToDate = forall $ \graph -> let build = do mapM_ touch (files graph) make (gen graph) in allUpToDate (finalState build)
As I mentioned, the gen function will have to do its rule generation based on the random graph we generated. The new code will be in the repo shortly... darcs get http://haskell.org/~duncan/cabal/dep-experiment/ Duncan

On Tue, Oct 30, 2007 at 10:45:34PM +0000, Duncan Coutts wrote:
On Fri, 2007-10-26 at 10:48 +0000, Duncan Coutts wrote:
So what's next...
We want to write some specifications
Spencer, Lennart, Thomas and I had a joint hacking session in which we made some progress on this issue today. Specifically we can now generate random dep graphs for use in QuickCheck tests. For example:
http://haskell.org/~duncan/cabal/foo.svg
We can tune the size and 'density' of these graphs. As you can see we get cycles in the graphs. We'll have to filter out cycles. Any suggestions on a simple way to do that? It's ok to depend on Data.Graph for this kind of test code.
It is a well known result in the theory of graphs that any directed graph admits a topological ordering. Thus, you could simply arrange to only generate edges from nodes to previously-created nodes. Although this would change the distribution of shapes somewhat. Stefan

On Tue, 2007-10-30 at 16:04 -0700, Stefan O'Rear wrote:
On Tue, Oct 30, 2007 at 10:45:34PM +0000, Duncan Coutts wrote:
On Fri, 2007-10-26 at 10:48 +0000, Duncan Coutts wrote:
So what's next...
We want to write some specifications
Spencer, Lennart, Thomas and I had a joint hacking session in which we made some progress on this issue today. Specifically we can now generate random dep graphs for use in QuickCheck tests. For example:
http://haskell.org/~duncan/cabal/foo.svg
We can tune the size and 'density' of these graphs. As you can see we get cycles in the graphs. We'll have to filter out cycles. Any suggestions on a simple way to do that? It's ok to depend on Data.Graph for this kind of test code.
It is a well known result in the theory of graphs that any directed graph admits a topological ordering. Thus, you could simply arrange to only generate edges from nodes to previously-created nodes. Although this would change the distribution of shapes somewhat.
Wouldn't that mean we can only get trees (instead of DAGs)?

On Wed, Oct 31, 2007 at 12:23:37AM +0100, Thomas Schilling wrote:
On Tue, 2007-10-30 at 16:04 -0700, Stefan O'Rear wrote:
On Tue, Oct 30, 2007 at 10:45:34PM +0000, Duncan Coutts wrote:
On Fri, 2007-10-26 at 10:48 +0000, Duncan Coutts wrote:
So what's next...
We want to write some specifications
Spencer, Lennart, Thomas and I had a joint hacking session in which we made some progress on this issue today. Specifically we can now generate random dep graphs for use in QuickCheck tests. For example:
http://haskell.org/~duncan/cabal/foo.svg
We can tune the size and 'density' of these graphs. As you can see we get cycles in the graphs. We'll have to filter out cycles. Any suggestions on a simple way to do that? It's ok to depend on Data.Graph for this kind of test code.
It is a well known result in the theory of graphs that any directed graph admits a topological ordering. Thus, you could simply arrange to only generate edges from nodes to previously-created nodes. Although this would change the distribution of shapes somewhat.
Wouldn't that mean we can only get trees (instead of DAGs)?
No. (1,2 with (1,2) and (1,2) as edges, fex) Stefan

On 10/30/07, Duncan Coutts
On Fri, 2007-10-26 at 10:48 +0000, Duncan Coutts wrote:
So what's next...
We want to write some specifications
Spencer, Lennart, Thomas and I had a joint hacking session in which we made some progress on this issue today. Specifically we can now generate random dep graphs for use in QuickCheck tests. For example:
Very nice :)
We can tune the size and 'density' of these graphs. As you can see we get cycles in the graphs. We'll have to filter out cycles. Any suggestions on a simple way to do that? It's ok to depend on Data.Graph for this kind of test code.
I noticed that. You could do a DFS and prune the back edges.
So the first testing strategy we're going for is to generate random dep graphs.
I'm thinking it may be possible to make a finite base set of test cases that represent of all possibilities, but I'm not sure about that.
From there we will write out a suitable selection of the files and run the make algorithm. We then inspect the resulting event trace and final state to check that everything worked correctly. So we'll be making the dependency generator from this random dep graph too.
So this is for testing make in isolation. For the next bit of testing our rule sets for building Haskell projects we'll have to generate dep graphs that use files mentioned in our rule sets. That's probably a bit harder.
I have an good idea about what's necessary here. Ideally, GHC would could give us a list of imports for a given sourcefile. (That way we don't have to do ad-hoc parsing.) The same thing goes for C2HS. I guess actually you might not be thinking that far ahead yet :) If you have anything specific in mind I can work on, feel free to let me know. Pete

On Tue, 2007-10-30 at 19:31 -0400, Peter Gavin wrote:
On 10/30/07, Duncan Coutts
wrote:
Very nice :)
Yeah, we were pretty pleased :-)
As you can see we get cycles in the graphs. We'll have to filter out cycles. Any suggestions on a simple way to do that?
I noticed that. You could do a DFS and prune the back edges.
Would that be minimal? We'd only like to delete something near the minimum number of edges to make the cyclic graph into a DAG.
So the first testing strategy we're going for is to generate random dep graphs.
I'm thinking it may be possible to make a finite base set of test cases that represent of all possibilities, but I'm not sure about that.
For testing our real rule sets I was thinking we should generate some test cases from real projects, eg Cabal's own dep graph.
From there we will write out a suitable selection of the files and run the make algorithm. We then inspect the resulting event trace and final state to check that everything worked correctly. So we'll be making the dependency generator from this random dep graph too.
So this is for testing make in isolation. For the next bit of testing our rule sets for building Haskell projects we'll have to generate dep graphs that use files mentioned in our rule sets. That's probably a bit harder.
I have an good idea about what's necessary here. Ideally, GHC would could give us a list of imports for a given sourcefile. (That way we don't have to do ad-hoc parsing.)
See ghc -M in the ghc user guide. Parsing that should be trivial.
The same thing goes for C2HS. I guess actually you might not be thinking that far ahead yet :)
If you have anything specific in mind I can work on, feel free to let me know.
Probably the next thing we want are some small but more real test cases that will provide milestones in the development of rule sets. For example we'd start with simple sets of .hs files. But next we'd want examples using cpp, .hi boot files, .hs files that produce _stub.c files as dynamic target, c2hs etc. These should be examples small enough to work with and debug but should also be sufficiently realistic. The next step after that would be some larger test cases. These would make great regression tests. Maybe you'd like to look at generating test cases from real projects. That is, generating sequences of actions in the Make monad to set up the virtual file system to a state representing the real project. For both these suggestions, generating small or realistic examples we need to consider the representation of import style dependencies in the presence of pre-processors. Lennart made the suggestion that we should represent the content of our virtual files as a list of sets of dependencies where each stage of pre-processing strips off a layer. For example consider a Foo.chs file: {# import Bar #} import Baz The Foo.hs, Foo.chi files depend on Bar.chi while Foo.o, Foo.hi depend on Bar.hi and Baz.hi. We'd represent the file content therefore as: [["Bar"] ,["Bar", "Baz"]] In your patch you were dealing with the real rule sets required to build stuff with ghc and c2hs. Perhaps you could look at the rule schema and dependency generation side of things. Currently in our model we hide all that behind a dependency generator function. Eventually of course we'll want some nice composable way of building dependency a generator function, probable from some representation of rule schema. Lots to do :-) Duncan
participants (6)
-
Duncan Coutts
-
Isaac Potoczny-Jones
-
Marc Weber
-
Peter Gavin
-
Stefan O'Rear
-
Thomas Schilling