I'm trying to write a compiler for a C-like language in Haskell. The compiler progresses by transforming an AST. The first pass parses the input to create the AST, tying a knot with the symbol table to allow symbols to be located before they have been defined without the need for forward references.
The AST contains information about types and expressions, and there can be connections between them; e.g. sizeof(T)
is an expression that depends on a type T
, and T[e]
is an array type that depends on a constant expression e
.
Types and expressions are represented by Haskell data types like so:
data Type = TypeInt Id
| TypePointer Id Type -- target type
| TypeArray Id Type Expr -- elt type, elt count
| TypeStruct Id [(String, Type)] -- [(field name, field type)]
| TypeOf Id Expr
| TypeDef Id Type
data Expr = ExprInt Int -- literal int
| ExprVar Var -- variable
| ExprSizeof Type
| ExprUnop Unop Expr
| ExprBinop Binop Expr Expr
| ExprField Bool Expr String -- Bool gives true=>s.field, false=>p->field
Where Unop
includes operators like address-of (&
), and dereference (*
), and Binop
includes operators like plus (+
), and times (*
), etc...
Note that each type is assigned a unique Id
. This is used to construct a type dependency graph in order to detect cycles which lead to infinite types. Once we are sure that there are no cycles in the type graph, it is safe to apply recursive functions over them without getting into an infinite loop.
The next step is to determine the size of each type, assign offsets to struct fields, and replace ExprField
s with pointer arithmetic. In doing so, we can determine the type of expressions, and eliminate ExprSizeof
s, ExprField
s, TypeDef
s, and TypeOf
s from the type graph, so our types and expressions have evolved, and now look more like this:
data Type' = TypeInt Id
| TypePointer Id Type'
| TypeArray Id Type' Int -- constant expression has been eval'd
| TypeStruct Id [(String, Int, Type')] -- field offset has been determined
data Expr' = ExprInt Type' Int
| ExprVar Type' Var
| ExprUnop Type' Unop Expr'
| ExprBinop Type' Binop Expr' Expr'
Note that we've eliminated some of the data constructors, and slightly changed some of the others. In particular, Type'
no longer contains any Expr'
, and every Expr'
has determined its Type'
.
So, finally, the question: Is it better to create two almost identical sets of data types, or try to unify them into a single data type?
Keeping two separate data types makes it explicit that certain constructors can no longer appear. However, the function which performs constant folding to evaluate constant expressions will have type:
foldConstants :: Expr -> Either String Expr'
But this means that we cannot perform constant folding later on with Expr'
s (imagine some pass that manipulates an Expr'
, and wants to fold any emerged constant expressions). We would need another implementation:
foldConstants' :: Expr' -> Either String Expr'
On the other hand, keeping a single type would solve the constant folding problem, but would prevent the type checker from enforcing static invariants.
Furthermore, what do we put into the unknown fields (like field offsets, array sizes, and expression types) during the first pass? We could plug the holes with undefined
, or error "*hole*"
, but that feels like a disaster waiting to happen (like NULL
pointers that you can't even check). We could change the unknown fields into Maybe
s, and plug the holes with Nothing
(like NULL
pointers that we can check), but it would be annoying in subsequent passes to have to keep pulling values out of Maybe
s that are always going to be Just
s.
self-updating) data structures are data structures that change their internal state according to a concrete usage pattern (e.g., self balancing trees). These changes are internal, i.e., depending the data. Moreover, these changes are anticipated upon by design.
A data structure is a specialized format for organizing, processing, retrieving and storing data. There are several basic and advanced types of data structures, all designed to arrange data to suit a specific purpose. Data structures make it easy for users to access and work with the data they need in appropriate ways.
Hopefully someone with more experience will have a more polished, battle-tested and ready answer, but here's my shot at it.
You can have your pie and eat part of it too at relatively little cost with GADTs:
{-# LANGUAGE GADTs #-}
data P0 -- phase zero
data P1 -- phase one
data Type p where
TypeInt :: Id -> Type p
TypePointer :: Id -> Type p -> Type p -- target type
TypeArray :: Id -> Type p -> Expr p -> Type p -- elt type, elt count
TypeStruct :: Id -> [(String, Type p)] -> Type p -- [(field name, field type)]
TypeOf :: Id -> Expr P0 -> Type P0
TypeDef :: Id -> Type P0 -> Type P0
data Expr p where
ExprInt :: Int -> Expr p -- literal int
ExprVar :: Var -> Expr p -- variable
ExprSizeof :: Type P0 -> Expr P0
ExprUnop :: Unop -> Expr p -> Expr p
ExprBinop :: Binop -> Expr p -> Expr p -> Expr p
ExprField :: Bool -> Expr P0 -> String -> Expr P0 -- Bool gives true=>s.field, false=>p->field
Here the things we've changed are:
The datatypes now use GADT syntax. This means that constructors are declared using their type signatures. data Foo = Bar Int Char
becomes data Foo where Bar :: Int -> Char -> Foo
(aside from syntax, the two are completely equivalent).
We have added a type variable to both Type
and Expr
. This is a so-called phantom type variable: there is no actual data stored that is of type p
, it is used only to enforce invariants in the type system.
We've declared dummy types to represent the phases before and after the transformation: phase zero and phase one. (In a more elaborate system with multiple phases we could potentially use type-level numbers to denote them.)
GADTs let us store type-level invariants in the data structure. Here we have two of them. The first is that recursive positions must be of the same phase as the structure containing them. For example, looking at TypePointer :: Id -> Type p -> Type p
, you pass a Type p
to the TypePointer
constructor and get a Type p
as result, and those p
s must be the same type. (If we wanted to allow different types, we could use p
and q
.)
The second is that we enforce the fact that some constructors can only be used in the first phase. Most of the constructors are polymorphic in the phantom type variable p
, but some of them require that it be P0
. This means that those constructors can only be used to construct values of type Type P0
or Expr P0
, not any other phase.
GADTs work in two directions. The first is that if you have a function which returns a Type P1
, and try to use one of the constructors which returns a Type P0
to construct it, you will get a type error. This is what's called "correct by construction": it's statically impossible to construct an invalid structure (provided you can encode all of the relevant invariants in the type system). The flip side of it is that if you have a value of Type P1
, you can be sure that it was constructed correctly: the TypeOf
and TypeDef
constructors can't have been used (in fact, the compiler will complain if you try to pattern match on them), and any recursive positions must also be of phase P1
. Essentially, when you construct a GADT you store evidence that the type constraints are satisfied, and when you pattern match on it you retrieve that evidence and can take advantage of it.
That was the easy part. Unfortunately, we have some differences between the two types beyond just which constructors are allowed: some of the constructor arguments are different between the phases, and some are only present in the post-transformation phase. We can again use GADTs to encode this, but it's not as low-cost and elegant. One solution would be to duplicate all of the constructors which are different, and have one each for P0
and P1
. But duplication isn't nice. We can try to do it more fine-grained:
-- a couple of helper types
-- here I take advantage of the fact that of the things only present in one phase,
-- they're always present in P1 and not P0, and not vice versa
data MaybeP p a where
NothingP :: MaybeP P0 a
JustP :: a -> MaybeP P1 a
data EitherP p a b where
LeftP :: a -> EitherP P0 a b
RightP :: b -> EitherP P1 a b
data Type p where
TypeInt :: Id -> Type p
TypePointer :: Id -> Type p -> Type p
TypeArray :: Id -> Type p -> EitherP p (Expr p) Int -> Type p
TypeStruct :: Id -> [(String, MaybeP p Int, Type p)] -> Type p
TypeOf :: Id -> Expr P0 -> Type P0
TypeDef :: Id -> Type P0 -> Type P0
-- for brevity
type MaybeType p = MaybeP p (Type p)
data Expr p where
ExprInt :: MaybeType p -> Int -> Expr p
ExprVar :: MaybeType p -> Var -> Expr p
ExprSizeof :: Type P0 -> Expr P0
ExprUnop :: MaybeType p -> Unop -> Expr p -> Expr p
ExprBinop :: MaybeType p -> Binop -> Expr p -> Expr p -> Expr p
ExprField :: Bool -> Expr P0 -> String -> Expr P0
Here with some helper types we've enforced the fact that some constructor arguments can only be present in phase one (MaybeP
) and some are different between the two phases (EitherP
). While this makes us completely type-safe, it feels a bit ad-hoc and we still have to wrap things in and out of the MaybeP
s and EitherP
s all the time. I don't know if there's a better solution in that respect. Complete type safety is something, though: we could write fromJustP :: MaybeP P1 a -> a
and be sure that it is completely and utterly safe.
Update: An alternative is to use TypeFamilies
:
data Proxy a = Proxy
class Phase p where
type MaybeP p a
type EitherP p a b
maybeP :: Proxy p -> MaybeP p a -> Maybe a
eitherP :: Proxy p -> EitherP p a b -> Either a b
phase :: Proxy p
phase = Proxy
instance Phase P0 where
type MaybeP P0 a = ()
type EitherP P0 a b = a
maybeP _ _ = Nothing
eitherP _ a = Left a
instance Phase P1 where
type MaybeP P1 a = a
type EitherP P1 a b = b
maybeP _ a = Just a
eitherP _ a = Right a
The only change to Expr
and Type
relative the previous version is that the constructors need to have an added Phase p
constraint, e.g. ExprInt :: Phase p => MaybeType p -> Int -> Expr p
.
Here if the type of p
in a Type
or Expr
is known, you can statically know whether the MaybeP
s will be ()
or the given type and which of the types the EitherP
s are, and can use them directly as that type without explicit unwrapping. When p
is unknown you can use maybeP
and eitherP
from the Phase
class to find out what they are. (The Proxy
arguments are necessary, because otherwise the compiler wouldn't have any way to tell which phase you meant.) This is analogous to the GADT version where, if p
is known, you can be sure of what MaybeP
and EitherP
contains, while otherwise you have to pattern match both possibilities. This solution isn't perfect either in the respect that the 'missing' arguments become ()
rather than disappearing entirely.
Constructing Expr
s and Type
s also seems to be broadly similar between the two versions: if the value you're constructing has anything phase-specific about it then it must specify that phase in its type. The trouble seems to come when you want to write a function polymorphic in p
but still handling phase-specific parts. With GADTs this is straightforward:
asdf :: MaybeP p a -> MaybeP p a
asdf NothingP = NothingP
asdf (JustP a) = JustP a
Note that if I had merely written asdf _ = NothingP
the compiler would have complained, because the type of the output wouldn't be guaranteed to be the same as the input. By pattern matching we can figure out what type the input was and return a result of the same type.
With the TypeFamilies
version, though, this is a lot harder. Just using maybeP
and the resulting Maybe
you can't prove anything to the compiler about types. You can get part of the way there by, instead of having maybeP
and eitherP
return Maybe
and Either
, making them deconstructor functions like maybe
and either
which also make a type equality available:
maybeP :: Proxy p -> (p ~ P0 => r) -> (p ~ P1 => a -> r) -> MaybeP p a -> r
eitherP :: Proxy p -> (p ~ P0 => a -> r) -> (p ~ P1 => b -> r) -> EitherP p a b -> r
(Note that we need Rank2Types
for this, and note also that these are essentially the CPS-transformed versions of the GADT versions of MaybeP
and EitherP
.)
Then we can write:
asdf :: Phase p => MaybeP p a -> MaybeP p a
asdf a = maybeP phase () id a
But that's still not enough, because GHC says:
data.hs:116:29:
Could not deduce (MaybeP p a ~ MaybeP p0 a0)
from the context (Phase p)
bound by the type signature for
asdf :: Phase p => MaybeP p a -> MaybeP p a
at data.hs:116:1-29
NB: `MaybeP' is a type function, and may not be injective
In the fourth argument of `maybeP', namely `a'
In the expression: maybeP phase () id a
In an equation for `asdf': asdf a = maybeP phase () id a
Maybe you could solve this with a type signature somewhere, but at that point it seems like more bother than it's worth. So pending further information from someone else I'm going to recommend using the GADT version, being the simpler and more robust one, at the cost of a little syntactic noise.
Update again: The problem here was that because MaybeP p a
is a type function and there is no other information to go by, GHC has no way to know what p
and a
should be. If I pass in a Proxy p
and use that instead of phase
that solves p
, but a
is still unknown.
There's no ideal solution to this problem, since each one has different pros and cons.
I would personally go with using a single data type "tree" and add separate data constructors for things that need to be differentiated. I.e.:
data Type
= ...
| TypeArray Id Type Expr
| TypeResolvedArray Id Type Int
| ...
This has the advantage that you can run the same phase multiple times on the same tree, as you say, but the reasoning goes deeper than that: Let's say that you are implementing a syntax element that generates more AST (something like include
or a C++ template kind of deal, and it can depend on constant exprs like your TypeArray
so you can't evaluate it in the first iteration). With the unified data type approach, you can just insert the new AST in your existing tree, and not only can you run the same phases as before on that tree directly, but you will get caching for free, i.e. if the new AST references an array by using sizeof(typeof(myarr))
or something, you don't have to determine the constant size of myarr
again, because its type is already a TypeResolvedArray
from your previous resolution phase.
You could use a different representation when you are past all of your compilation phases, and it is time to interpret the code (or something); then you are certain of the fact that no more AST changes will be necessary, and a more streamlined representation might be a good idea.
By the way, you should use Data.Word.Word
instead of Data.Int.Int
for array sizes. It is such a common error in C to use int
s for indexing arrays, while C pointers actually are unsigned. Please don't make this mistake in your language, too, unless you really want to support arrays with negative sizes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With