[Haskell-cafe] Modelling structurally subtyped and cyclic data in type safe Haskell

8 Dec 2016

      Hello

This is my first post to Haskell Café. I am a hobbyist programmer who has lots
of experience with imperative and OO languages ranging from 80286 assembly to
Ada and Ruby. Last four years I've been learning Haskell, which has been a
painfully educational experience - I have a good understanding about how CPUs
work, but very little understanding about maths. This is a very long post, I apologize.

Now I'm trying to create a tool which automatically generates foreign function
interface (FFI) imports from C and C++ to Haskell. I have written a GCC plugin
which extracts all relevant (and lots of irrelevant) information from GCC's
TREEs and dumps the output as JSON. It also dumps preprocessor macros which is
one of the reasons this tol is being given birth.

The tool is used like Shake - import Development.Familiar and write code which
glues all relevant data processing and FFI generation functions together.

This is also going to be a reusable library which others can use for whatever
they find the data to be useful for. This could include making FFI bindings to
other languages. Some code analysis tools also might find the data useful.

I'm having a hard time importing this data to Haskell in a type safe way.
GCC's TREEs are structured in an OO way. I have tried to find a more
functional way of representing them, but so far with no luck.

The JSON data is typed, and the hierarchy is exactly the same as in GCC. Below
I construct a tree of the hierarchy where I list a few of possible types and
very few of the fields the data types contain. For listing fields and their
types I use a syntax similar to Haskell. This shortened listing should be
enough for illustration:

* macro: (data fields include: name :: string, tokens :: [token], etc.)
* tree: all nodes have a unique ID to be able to recover the cycles
    among the node graph; (id :: integer)
* * translation_unit: (declarations :: [top_level_declaration])
* * constant: a compile-time constant value; (type :: type)
* * * complex_constant: (value :: complex)
* * * integer_constant: (value :: integer)
* * namespace_decl: (name :: identifier,
                     declarations :: [top_level_declaration])
* * declaration: (name :: identifier, type :: type)
* * * const_decl: One component of enum type; (value :: type)
* * * function_decl: (type :: function_type, params :: [param_decl],
                      result :: result_decl)
* * * type_decl: (no additional fields)
* * type: (declaration :: type_decl, name :: identifier,
           size :: integer_constant, alignment :: integer,
           qualifiers :: [qualifier], completeness :: bool)
* * * array_type: (element_type :: type, element_count :: integer)
* * * numeric_type: (precision :: integer, signed :: bool,
                     min_value :: integer_constant,
                     max_value :: integer_constant)
* * * * enumeral_type: (values :: [const_decl])
* * * function_type: (param_types :: [type], result_type :: type)
* * * pointer_type: (referred_type :: type)
* * * record_type: struct (C or C++)/union;
        (base_types :: [(type, access_info)],
         fields :: [field_decl or const_decl or type_decl],
         methods :: [method_decl])
* reference: Any node may be replaced by a reference. They can be
    forward or backward references. (referred_id :: integer)

As you may see, tree nodes form many many cycles together. Subclasses
sometimes provide stronger invariants for their fields than the parent
classes. For example all declarations are associated with a type and for
function_decl nodes this type is always a function_type.

Now to my question. How to represent this in Haskell. Because of the enormous
amount of fields per tree node, I certainly need to use records.

-- Haskell follows:
data BaseTree = BaseTree { id :: Int, ... ... }
data BaseDecl = BaseDecl { baseTree :: BaseTree, name :: Identifier,
                           declType :: Type, ... }
data ConstDecl = ConstDecl { baseDecl :: BaseDecl, value :: ConstValue }
data ConstValue = ComplexConst Complex | IntegerConst Integer
data TypeDecl = TypeDecl { baseDecl :: BaseDecl }

But... What shall I do about the heterogenous lists which many nodes contain.
I wish I don't need to make myriards of ADTs for them, like:

data AnyTopLevelDecl = ATLDConst ConstDecl
                     | ATLDFunction FunctionDecl
                     | ATLDNamespace NamespaceDecl
                     | ATLDType TypeDecl -- ...
data StructFieldDecl = SFDConst ConstDecl
                     | SFDField FieldDecl
                     | SFDType TypeDecl

And so forth, basicly each heterogenous list gets its own ADT, because there
is often variance in which node types can appear in the lists.

I could use one big ADT like AnyDecl. However, to me it would feel like a bad
design, if my types allowed RecordType nodes to reference NamespaceDecls, and
therefore I don't want to pack all nodes into one type, unless I could use
some type parameters or constraints to restrict the possible values the ADT
could take.

At first I thought that the following could solve my problem:

data SomeDecl a = SomeDecl where
    SomeConst :: ConstDecl -> SomeDecl ConstDecl
    SomeField :: FieldDecl ->SomeDecl FieldDecl
    SomeFunction :: FunctionDecl -> SomeDecl FunctionDecl
    SomeType :: TypeDecl -> SomeDecl TypeDecl

-- and then I would use type classes like this
class TopLevelDecl decl where
    -- No methods needed
instance TopLevelDecl ConstDecl
instance TopLevelDecl FunctionDecl
instance TopLevelDecl TypeDecl

-- And then I could write functions like this:
handleTopLevelDecl :: TopLevelDecl d => SomeDecl d -> Whatever
handleTopLevelDecl (FunctionDecl f) = -- ...
-- and I'd list all cases which have a TopLevelDecl instance, but not
-- those which don't have. I also thought that my data types could look like
-- this:

data NamespaceDecl where
    NamespaceDecl :: {
        baseTree :: BaseTree,
        forall a. TopLevelDecl a => declarations :: [a]
        }

Obviously this does not work, because type classes are open, and even though
GHC prevents me from calling handleTopLevelDecl with (SOmeDecl FieldDecl), it
still complains if I don't define an equation for it, because in some other
module somebody might make a TopLevelDecl instance for FieldDecl.

Now I would like to know if there is a way to solve my problem in an elegant
way, e.g. by using type families.

My second question is how to resolve the cyclicity of the node graph. I don't
want to tie the knot, because I want to be able to manipulate the graph.

Maps are one option, yes:

data AnyNode = (list all possible nodes)
type Id = Int
type NodeIdMap = Map Id AnyNode

-- Then my data types refer to other nodes like this:
data FunctionDecl = FunctionDecl {
    parameters :: [Id]
    }

Now I lose all type safety. I would have to insert a run-time check to every
dereference of an Id if I want to make sure that function parameter
declarations don't contain StringConstants.

I've also considered using STRefs. During JSON parsing I would need to
populate the nodes with the node Ids they refer to and then have a separate
pass which turns all Ids to typed STRefs. At least with this approach I would
get all possible run-time errors straight after parsing, and not when the data
is used.

I'm again sorry for this long elaboration. However I think that if you
consider answering me you have a pretty good understanding of my problem and
also a good knowledge of my level of understanding Haskell. I would like to
unlearn OO-style thinking, and I would want to find a data and type
representation which feels Haskellish. However I don't want to cut down the
data. I want all information to be accessible in case somebody needs it.

Thank you in advance!

-- 
Aura

[Haskell-cafe] Modelling structurally subtyped and cyclic data in type safe Haskell

Aura Kelloniemi