Module extractor

1 Feb 2001


      Hi All,

	During the last few days I've been working on the
	ModuleExtractor - a high level extractor of modules from
	the Haskell source files. This is not a low level
	parser -- as used by the compilers -- since it only cares
	for the things related to documentation.
 
	I am using the Daan's Leijen Parsec library, which seems
	to be well designed, documented and reasonably fast.

	The motivation for this work is to replace my home
	brewed parsing of source files in Haskell Module
	Browser (or rather a sophisticated "grepping") - which
	I currently do in Smalltalk - by a Haskell version. I do
	not believe that I will gain much on speed here (Hugs
	implementation will be probably much slower than the Squeak's
	one) but the idea is to move as much code as possible from
	the Squeak's to the Haskell's side in order to create
	a support code which could benefit other people wishing to
	interface such browsers to systems other than the Squeak.

	I think this information is relevant to our discussion
	and could help in clarifying some issues and provide
	some experimental tool.

	The parser aims to extract this information from the source
	files:

	data Module = Module
		{ name       :: String     -- done
		, comment    :: String     -- done
		, exports    :: [Export]   -- chunk for now	
		, imports    :: [Import]   -- chunk for now
		, fixities   :: [Fixity]   -- done 
		, classes    :: [Class]    -- chunk for now
		, instances  :: [Instance] -- chunk for now
		. categories :: [String]   -- chunk for now
		, functions  :: [Function] -- done
		, footnote   :: String     -- done
		}

	At the first stage, the parser breaks the source code into
	chunks:
	
	type Chunk = [Comment, Code] 
 	
	and then examines each chunk to convert it to one of the
	above specified entities. For example, the Function datatype
	is defined as:
	
	data Function = Function
		{ funName      :: String
		, funSignature :: Signature
		, funBody      :: String
		}

	   
	The good news is that the parser is able to deal with
	any positional placement of comments. For example,
	when it deals with functions it considers any one or all
	(concatenating all of them) the following comment options:
 
	+ Many "--" comments or "{- .. -}" comment before the signature
	x Signature
	+ Many "--" comments or "{- .. -}" comment after the signature
	x First line of function body
	+	Many indented "--" comment lines
	x       Indented function body
 
	Similar pattern applies to other entities. But in order
	of this positional approach to work I had to admit
	a concept of a category (known and cherished in Smalltalk,
	Objective C, Eiffel). In Haskell case, a special banner
	separates groups of functions. If this is not indicated
	somehow then the banner would become a part of the
	comment of the entity that follows it (wrong, but not
	catastrophic). 

	It seems, after all, that I was not entirely correct in
	one of my previous posts - an intelligent parser can
	cope with a purely positional layout, given a bit of help
	related to definition of category delimiters.
	I should have remembered this, because I've done similar
	parsing for Xcoral browser for Java.

	I thought that this would be a helpful information
	for our discusion. I'll post the code when it's ready.

	Jan

Module extractor

Jan Skibinski