Analysing Haskell with Graph Theory

Back in December, I posted that I was thinking about doing a project for my honours year on using graph theory for analysis on Haskell source code, amongst others: http://thread.gmane.org/gmane.comp.lang.haskell.cafe/32912 Well, I've officially started this project, and my supervisor and I have come up with the following functions that would probably be relevant to analysing _any_ piece of source code (the actual code will be tentatively stored as a directed graph with the nodes being functions and the edges being function calls, i.e. if f calls g, there is a directed edge from f to g): * root finding -> find nodes with no incoming edges (in a program, you'd ideally want only main to be such a node) * cycle detection -> find a possibly recursive cycle in your code: if the cycle is too big, then you may wish to consider refactoring * depth analysis -> find leaves that have a depth from the root node that is extremely large compared to the others (if a function is 50 calls down from main compared to an average of 15, you may wish to refactor) * chain detection -> find connected functions that have only one incoming and one outgoing edge, e.g. : f -> g -> h : if g isn't used anywhere else, you may wish to combine it inside either f or g * node popularity -> get a count of how many functions use a particular function and how many other functions it calls (related to chain detection above) * clique detection -> probably not as relevant to source code, but if you have a large number of functions that are all pairwise co-recursive than you may wish to refactor Can anyone think of any other kind of functions that would be useful in this kind of source code analysis? Also, I'm going to be using haskell-src for for the parsing of the Haskell code so that I don't waste my time writing a parser, when my project is technically on the graph-theory side of things. I know that it focuses mainly on H98 with only a few extensions... is this likely to be a problem if I want to try running my eventual program on say ghc or xmonad as test examples? -- Ivan Lazar Miljenovic

be tentatively stored as a directed graph with the nodes being functions and the edges being function calls, i.e. if f calls g, there is a directed edge from f to g):
Will the graph be completely computed before the analysis begins? Or will you try to build the graph lazily as and when you require more information? And
* root finding -> find nodes with no incoming edges (in a program, you'd ideally want only main to be such a node)
* cycle detection -> find a possibly recursive cycle in your code: if the cycle is too big, then you may wish to consider refactoring
* depth analysis -> find leaves that have a depth from the root node that is extremely large compared to the others (if a function is 50 calls down from main compared to an average of 15, you may wish to refactor)
* chain detection -> find connected functions that have only one incoming and one outgoing edge, e.g. : f -> g -> h : if g isn't used anywhere else, you may wish to combine it inside either f or g
* node popularity -> get a count of how many functions use a particular function and how many other functions it calls (related to chain detection above)
* clique detection -> probably not as relevant to source code, but if you have a large number of functions that are all pairwise co-recursive than you may wish to refactor
Can anyone think of any other kind of functions that would be useful in this kind of source code analysis?
Are you looking for Haskell functions that can be used to solve the above problems? I guess once you come up with the algorithms, translating it into Haskell shouldnt be much of a problem. -- Vimal

On 07/03/2008, Vimal
Will the graph be completely computed before the analysis begins? Or will you try to build the graph lazily as and when you require more information?
Probably beforehand.
Are you looking for Haskell functions that can be used to solve the above problems? I guess once you come up with the algorithms, translating it into Haskell shouldnt be much of a problem.
No, I'm just asking people what else they think I should be looking for. -- Ivan Lazar Miljenovic

"Ivan Miljenovic"
Can anyone think of any other kind of functions that would be useful in this kind of source code analysis?
Sometimes, it's not obvious where to draw boundaries between modules, perhaps finding a "smallest cut" (if that is the correct term) could help to minimize the interfaces? I.e. find tightly connected components that have relatively few inbound links. Sounds like an interesting project, good luck! -k -- If I haven't seen further, it is by standing in the footprints of giants

On 07/03/2008, Ketil Malde
Sometimes, it's not obvious where to draw boundaries between modules, perhaps finding a "smallest cut" (if that is the correct term) could help to minimize the interfaces? I.e. find tightly connected components that have relatively few inbound links.
As in the seeing if a module can be split into sub-modules (i.e. disjoint or nearly disjoint sub-graphs inside the module graph)? I was thinking about something like that a few days ago but had forgotten about it when discussing it with my supervisor. (It doesn't help that he's mainly a C-programmer, and doesn't really understand what I mean when I talk about modules). -- Ivan Lazar Miljenovic

On 07/03/2008, Arnar Birgisson
Will you be considering parallel programs? Also, perhaps some information flow analysis would be interesting.
What do you mean by parallel programs? The parallelism hints used by ghc? In that case, I'll be supporting whatever the parser I can find supports :p -- Ivan Lazar Miljenovic

On Fri, Mar 7, 2008 at 12:42 PM, Ivan Miljenovic
On 07/03/2008, Arnar Birgisson
wrote: Will you be considering parallel programs? Also, perhaps some information flow analysis would be interesting.
What do you mean by parallel programs? The parallelism hints used by ghc? In that case, I'll be supporting whatever the parser I can find supports :p
I mean parallel programs in general. For example, you mention that preferably one would want "main" to be the only function that has no incoming edges. With parallel programs that may not be the case. This ties into also considering reactive programs (i.e. programs which main function is of the form "while (1) { ... }"). I'm not quite sure if it fits the project you describe, but looking for certain properties like possible deadlocks or livelocks in such programs is something of interest. There is also a security angle to this. Consider two agents, one representing an ATM and one representing the branch office of a bank. Analysing the possible possible information paths between the two can help you define a clear interface between the two (assuming one doesn't exist already). Having such an interface rigorously defined helps with maintaining security properties. Also related to considering parallel programs are coroutines. Analysing the "call graph" of systems of coroutines might be interesting as well. I'm just thinking outloud though, some of these might not be interesting at all :) cheers, Arnar
participants (4)
-
Arnar Birgisson
-
Ivan Miljenovic
-
Ketil Malde
-
Vimal