Graph theory analysis of Haskell code

This isn't strictly Haskell related, but anyway. Next year I will be doing my honours in mathematics. One possible topic for my thesis that I've thought of - and my supervisor is quite enthused about - is to use graph theory to analyse various textual sources, starting with source code but leaving the framework open enough to be able to extend it to other sources (e.g. email address books). How I envisage it happening is that a parser would be used to find all "functions" in the given code, treat these as nodes in the graph and then use directed edges to indicate which functions call other functions. This resultant graph can then be analysed in various ways suitable to the context (e.g. find that a library module can be split into two since there are two completely separate trees present in the graph that don't interact at all, or if a function is only ever called by one other function then it can be subsumed into it). So, here is the question I ask of all of you: is this feasible? Do you know if anything like this has ever been attempted before? I know there are some other usages of graph theory related to source code (e.g. McCabes complexity metric [1]), but I couldn't seem to find anything related to what I'm proposing. I intend to code this up in Haskell (possibly using FGL: I know of it, but haven't really looked at it) and use Haskell as my primary target for analysis, so in a sense the resultant graph could be seen as a Haskell equivalent to UML. [1] http://en.wikipedia.org/wiki/Cyclomatic_complexity -- Ivan Lazar Miljenovic

On 12/5/07, Ivan Miljenovic
How I envisage it happening is that a parser would be used to find all "functions" in the given code, treat these as nodes in the graph and then use directed edges to indicate which functions call other functions.
aka a "call graph". This is called "control flow analysis" and the classic paper on it is Olin Shivers' dissertation, "Control Flow Analysis of Higher Order Languages" (http://repository.readscheme.org/ftp/papers/shivers-diss.ps.gz ).
This resultant graph can then be analysed in various ways suitable to the context (e.g. find that a library module can be split into two since there are two completely separate trees present in the graph that don't interact at all, or if a function is only ever called by one other function then it can be subsumed into it).
One example of an analysis like this is done by GHC's inliner. See the paper "Secrets of the Glasgow Haskell Compiler Inliner" by Peyton Jones and Marlow (http://research.microsoft.com/~simonpj/Papers/inlining/).
So, here is the question I ask of all of you: is this feasible? Do you know if anything like this has ever been attempted before? I know there are some other usages of graph theory related to source code (e.g. McCabes complexity metric [1]), but I couldn't seem to find anything related to what I'm proposing. I intend to code this up in Haskell (possibly using FGL: I know of it, but haven't really looked at it) and use Haskell as my primary target for analysis, so in a sense the resultant graph could be seen as a Haskell equivalent to UML.
This is very well-trodden ground, but if you familiarize yourself with the literature on the subject, then who knows, you may discover something new. And you can take pleasure in knowing that you've already independently conceived of an idea that lots of smart people have seen fit to put a lot of time into :-) Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "It's true I don't want to join the Army or turn lathes in precision parts factories, I'm nearsighted and psychopathic anyway. America I'm putting my queer shoulder to the wheel." -- Allen Ginsberg

On 06/12/2007, Tim Chevalier
This is very well-trodden ground, but if you familiarize yourself with the literature on the subject, then who knows, you may discover something new. And you can take pleasure in knowing that you've already independently conceived of an idea that lots of smart people have seen fit to put a lot of time into :-)
\o/ Yay! :p -- Ivan Lazar Miljenovic

"Tim Chevalier"
aka a "call graph". This is called "control flow analysis" and the classic paper on it is Olin Shivers' dissertation
This is very well-trodden ground, but if you familiarize yourself with the literature on the subject, then who knows, you may discover something new.
I'll just add that having a tool visualizing this would be very useful for refactoring code. If you e.g. use color to distinguish nodes/functions from different modules, you could use that information to decide to merge or split modules to minimize external interfaces. You could also try to automatically cluster nodes, which would be more interesting theoretically, but IMO less likely to be practical. Another option would be to couple this with profiling or coverage information to visualize something about the usage of paths and nodes in the call graph. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Thu, Dec 06, 2007 at 09:34:30AM +0100, Ketil Malde wrote:
"Tim Chevalier"
writes: This is very well-trodden ground, but if you familiarize yourself with the literature on the subject, then who knows, you may discover something new.
I'll just add that having a tool visualizing this would be very useful for refactoring code. If you e.g. use color to distinguish nodes/functions from different modules, you could use that information to decide to merge or split modules to minimize external interfaces.
You could also try to automatically cluster nodes, which would be more interesting theoretically, but IMO less likely to be practical.
Another option would be to couple this with profiling or coverage information to visualize something about the usage of paths and nodes in the call graph.
Indeed, a visualization tool like this would be cool! -- David Roundy Department of Physics Oregon State University

This sounds like a fun project and it is certainly feasible to do. I
thought I'd give you some pointers to fun stuff that people have been
doing in the past.
Thomas Reps have been doing program analysis since the dawn of time,
but one paper that seems particularly related to what you try to do is
"Identifying modules via concept analysis". You can find that and the
rest of his papers on his homepage:
http://pages.cs.wisc.edu/~reps/
There are many different characteristics of a program graph that can
be of interest. One that has recently given rise to some interesting
results is the notion of tree width. An example of it's is the
following paper:http://citeseer.ist.psu.edu/601409.html
Have fun,
Josef
On Dec 6, 2007 12:46 AM, Ivan Miljenovic
This isn't strictly Haskell related, but anyway.
Next year I will be doing my honours in mathematics. One possible topic for my thesis that I've thought of - and my supervisor is quite enthused about - is to use graph theory to analyse various textual sources, starting with source code but leaving the framework open enough to be able to extend it to other sources (e.g. email address books).
How I envisage it happening is that a parser would be used to find all "functions" in the given code, treat these as nodes in the graph and then use directed edges to indicate which functions call other functions. This resultant graph can then be analysed in various ways suitable to the context (e.g. find that a library module can be split into two since there are two completely separate trees present in the graph that don't interact at all, or if a function is only ever called by one other function then it can be subsumed into it).
So, here is the question I ask of all of you: is this feasible? Do you know if anything like this has ever been attempted before? I know there are some other usages of graph theory related to source code (e.g. McCabes complexity metric [1]), but I couldn't seem to find anything related to what I'm proposing. I intend to code this up in Haskell (possibly using FGL: I know of it, but haven't really looked at it) and use Haskell as my primary target for analysis, so in a sense the resultant graph could be seen as a Haskell equivalent to UML.
[1] http://en.wikipedia.org/wiki/Cyclomatic_complexity
-- Ivan Lazar Miljenovic _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Ivan Miljenovic wrote:
How I envisage it happening is that a parser would be used to find all "functions" in the given code, treat these as nodes in the graph and then use directed edges to indicate which functions call other functions. This resultant graph can then be analysed in various ways suitable to the context (e.g. find that a library module can be split into two since there are two completely separate trees present in the graph that don't interact at all, or if a function is only ever called by one other function then it can be subsumed into it).
It just occurred to me that this idea is more general than the control or data flow analysis that are the focus of similar ideas I've seen before. For example, you could trace type usage through the code (which would likely be a subset of the control flow for Haskell, but an interesting subset nonetheless). :-) -- Tommy M. McGuire mcguire@crsr.net

On 07/12/2007, Tommy McGuire
How I envisage it happening is that a parser would be used to find all "functions" in the given code, treat these as nodes in the graph and then use directed edges to indicate which functions call other functions. This resultant graph can then be analysed in various ways suitable to the context (e.g. find that a library module can be split into two since there are two completely separate trees present in the graph that don't interact at all, or if a function is only ever called by one other function then it can be subsumed into it).
It just occurred to me that this idea is more general than the control or data flow analysis that are the focus of similar ideas I've seen before. For example, you could trace type usage through the code (which would likely be a subset of the control flow for Haskell, but an interesting subset nonetheless). :-)
While I'd like to do this, for the purposes of ensuring that the
project contains enough mathematics (and to keep my supervisor happy,
since he is skeptical about why I keep bringing Haskell into
everything, even though I've re-implemented in Haskell a program that
he wrote in C which generated more/better answers, and we _still_
can't find where the bugs in his code are, but that's another
story...), I'm not sure if I can delve too deeply into
Haskell-specifics. Of course, if I can somehow turn the type usage
into nodes on the graph, then that should be all right! :D
On 06/12/2007, Ketil Malde
I'll just add that having a tool visualizing this would be very useful for refactoring code. If you e.g. use color to distinguish nodes/functions from different modules, you could use that information to decide to merge or split modules to minimize external interfaces.
You could also try to automatically cluster nodes, which would be more interesting theoretically, but IMO less likely to be practical.
Another option would be to couple this with profiling or coverage information to visualize something about the usage of paths and nodes in the call graph.
This is partially what I was hoping to do. I do know that my supervisor was interested in examining C code and attaching cost information to the nodes (using some Unix profiling tool which he couldn't remember the name of), but I'm not sure how I'd go about adding compilation and profiling into such a program. -- Ivan Lazar Miljenovic

Ivan Miljenovic wrote:
On 07/12/2007, Tommy McGuire
wrote: It just occurred to me that this idea is more general than the control or data flow analysis that are the focus of similar ideas I've seen before. For example, you could trace type usage through the code (which would likely be a subset of the control flow for Haskell, but an interesting subset nonetheless). :-)
While I'd like to do this, for the purposes of ensuring that the project contains enough mathematics (and to keep my supervisor happy, since he is skeptical about why I keep bringing Haskell into everything, even though I've re-implemented in Haskell a program that he wrote in C which generated more/better answers, and we _still_ can't find where the bugs in his code are, but that's another story...), I'm not sure if I can delve too deeply into Haskell-specifics. Of course, if I can somehow turn the type usage into nodes on the graph, then that should be all right! :D
I was actually thinking that something like that would be more valuable for a language like C, where types are not represented in the control flow. By the way, in a completely different context I just ran across a couple of references: "Concrete Architecture of the Linux Kernel". Ivan T. Bowman, Saheem Siddiqi, and Meyer C. Tanuan. "Conceptual Architecture of the Linux Kernel", Ivan T. Bowman. (The first is available on CiteSeer; I haven't found the second.)
On 06/12/2007, Ketil Malde
wrote: I'll just add that having a tool visualizing this would be very useful for refactoring code. If you e.g. use color to distinguish nodes/functions from different modules, you could use that information to decide to merge or split modules to minimize external interfaces.
You could also try to automatically cluster nodes, which would be more interesting theoretically, but IMO less likely to be practical.
Another option would be to couple this with profiling or coverage information to visualize something about the usage of paths and nodes in the call graph.
This is partially what I was hoping to do. I do know that my supervisor was interested in examining C code and attaching cost information to the nodes (using some Unix profiling tool which he couldn't remember the name of),
prof and gprof?
but I'm not sure how I'd go about adding compilation and profiling into such a program.
-- Tommy M. McGuire mcguire@crsr.net

On 07/12/2007, Tommy McGuire
I was actually thinking that something like that would be more valuable for a language like C, where types are not represented in the control flow.
By the way, in a completely different context I just ran across a couple of references:
"Concrete Architecture of the Linux Kernel". Ivan T. Bowman, Saheem Siddiqi, and Meyer C. Tanuan.
"Conceptual Architecture of the Linux Kernel", Ivan T. Bowman.
(The first is available on CiteSeer; I haven't found the second.)
OK, thanks.
prof and gprof?
That looks like them. Of course, us Haskellers have no need of such things, since our programs have profiling inbuilt into the RTS (at least when using GHC, anyway)! -- Ivan Lazar Miljenovic

On Fri, Dec 07, 2007 at 10:16:43AM +1000, Ivan Miljenovic wrote:
On 07/12/2007, Tommy McGuire
wrote: I was actually thinking that something like that would be more valuable for a language like C, where types are not represented in the control flow.
By the way, in a completely different context I just ran across a couple of references:
"Concrete Architecture of the Linux Kernel". Ivan T. Bowman, Saheem Siddiqi, and Meyer C. Tanuan.
"Conceptual Architecture of the Linux Kernel", Ivan T. Bowman.
(The first is available on CiteSeer; I haven't found the second.)
OK, thanks.
prof and gprof?
That looks like them. Of course, us Haskellers have no need of such things, since our programs have profiling inbuilt into the RTS (at least when using GHC, anyway)!
prof derives most of its utility from special hooks baked into gcc and libc. So gcc/prof is not really that different from ghc/hp2ps. Stefan
participants (7)
-
David Roundy
-
Ivan Miljenovic
-
Josef Svenningsson
-
Ketil Malde
-
Stefan O'Rear
-
Tim Chevalier
-
Tommy McGuire