ANTLR4: Re-visiting parse rules after the whole ast is visited

Question

I am currently implementing generic functions for my own language, but I got stuck and currently have the following problem:

Generic functions can get called from another source file (another parser instance). Let's assume we have a generic function in source file B and we call it from source file A, which imports source file B. When this happens, I need to type-check the body of the function (source file B) once again for every distinct manifestation of concrete types, derived from the function call (source file A). For that, I need to visit the body of the function in source file B potentially multiple times.

Source file B:

type T dyn;

public p printFormat<T>(T element) {
    printf("Test");
}

Source file A:

import "source-b" as b;

f<int> main() {
    b.printFormat<double>(1.123);
    b.printFormat<int>(543);
    b.printFormat<string[]>({"Hello", "World"});
}

I tried to realize that approach by putting the code for analyzing the function body and its children in an inner function and call it every time I encounter a call to that particular function from anywhere (also from other source files). This seems not to work for some reason. I always get a segmentation fault. Maybe this is because the whole tree was already visited once?

For additional context: C++ source code of my visitor

Would appreciate some useful answers or tips, thank you! ;)

MSalters · Accepted Answer

I don't think the best approach is to hack around with parsers. Parsers should turn one array of characters into one AST.

In your case, you've got a fairly complex but new language, using multiple files. When you import B, you really want to import the AST. C++ historically messed with a literal #include and the parsing problems that brings, and only now is getting modules. Languages like Java did away with this textual inclusion, but retrofitted generics later on. You've got a clean slate. You should design your language such that the compiler can just take a bunch of AST's as its input.

Since the compiler will take AST's as input, each AST will be read-only. You can of course have a cache for instantiations so you don't need to re-instantiate printFormat<int> every time you encounter it in an AST, but that's a detail.

What's not an detail is how instantiation should work in your language. A common mistake is the assumption that C++ templates work like macro's, at text level. That's not the case; they work at the language level. Yours should work also at the language level. It would be really convenient for you if instantiation took an AST (or at least a subtree thereof) and would then produce a new AST for the instantiation, again read-only. It's no coincidence that the C++ template meta-language is effectively a functional language. These kinds of problems become much easier the more you can make read-only.

v.oddou · Answer

You're going to be interested by this blog post https://devblogs.microsoft.com/cppblog/two-phase-name-lookup-support-comes-to-msvc/

What you're trying to do necessitates you to make a decision about your language. Do you want to do Java/C# generics?
In that case, you will apply type erasure. Which means you don't have to track all different type instanciations across your program. You validate once with a limited interface, and later encountered use points (call sites) are guaranteed to generate valid instanciations.

It seems you are using modules (by contrast to header) so you're faced with the classic export problem. I would suggest to serialize your generic symbols in some compiler-memory-model binary form, and "import source-b" would mean deserializing that structure. It would store the "printFormat" symbol in your internal representation (data model), allowing you to schedule a concrete instanciation/code-gen later.

If you chose the template philosophy, then you can't do that, because you cannot run any sort of semantic pass in the body of your template symbols. This is the two-phase compilation paradigm. You'll have to serialize the AST itself. Either by finding a way to reuse the antlr nodes, and annex a serializer that can dump and reconstruct them. Or by replicating the AST with your own node classes, and use boost::serialization (or some macro such as https://stackoverflow.com/a/43207178/893406).

Then, each call site will have to invoke a concretization (template instanciation), where you'd insert a new symbol with some type mangling to make them unique, at the point of the AST where the original generic symbol is declared first. (AST is immutable during visitation, so be careful to insert it into a COPY of the AST). Then, mark some global bool flag to remember that you need to run a brand new parse and semantic validation all over again, on the whole file, at the end.

So when your first visitor is finished, you check that flag, and re-run the whole thing, on the hacked AST tree (the copied AST).

These 2 choices, are the difference between a universal type, and an existential type. enter image description here

ANTLR4: Re-visiting parse rules after the whole ast is visited

Tags:

c++

parsing

antlr

antlr4

abstract-syntax-tree

ChilliBits

2 Answers

MSalters

v.oddou

Recent Activity

Donate For Us

ANTLR4: Re-visiting parse rules after the whole ast is visited

Tags:

c++

parsing

antlr

antlr4

abstract-syntax-tree

ChilliBits

2 Answers

MSalters

v.oddou

Related questions

Recent Activity

Donate For Us