Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using ANTLR to analyze and modify source code; am I doing it wrong?

I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:

foo(['a', 'b', 'c']);

Extract 'a', 'b', and 'c' and rewrite the code as:

foo('bar', [0, 1, 2]);

I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.

The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.

My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:

a.b.foo();

To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:

Program
    CallExpression
        Identifier('foo')
        ArgumentList
            ArrayLiteral
                StringLiteral('a')
                StringLiteral('b')
                StringLiteral('c')

This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.

I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?

like image 563
Jacob Avatar asked Jul 23 '12 07:07

Jacob


People also ask

What is ANTLR used for?

What is ANTLR? ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. Terence Parr is a tech lead at Google and until 2022 was a professor of data science / computer science at Univ.

How do you use ANTLR in Python?

What you need to do to get a parse tree: define a lexer and parser grammar. invoke ANTLR: it will generate a lexer and a parser in your target language (e.g., Java, Python, C#, JavaScript) use the generated lexer and parser: you invoke them passing the code to recognize and they return to you a parse tree.

What is ANTLR lexer?

A lexer is recognizer that draws input symbols from a character stream. lexer grammars result in a subclass of this object. A Lexer object uses simplified match() and error recovery mechanisms in the interest of speed.


2 Answers

What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.

Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.

The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.

For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:

       domain ECMAScript.

       tag replace;  -- says this is a special kind of temporary tree


       rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
           expression->expression
        =   " \function_name ( '[' \list ']' ) "
        -> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";


       rule replace_unit_list(s:character_literal):
           expression_list -> expression_list
           replace(s) -> compute_index_for(s);

       rule replace_long_list(s:character_list, list:expression_list):
           expression_list -> expression_list
           "\replace\(\s\,\list)->  "compute_index_for\(\s\),\list";

with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.

Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.

This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).

Your specific example may vary somewhat since you really weren't precise about the details.

Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).

You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.

Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.

like image 64
Ira Baxter Avatar answered Sep 28 '22 01:09

Ira Baxter


So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:

var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;

var dependencies = new List<IModule>();

var functionCall = (
    from callExpression in program.Children.OfType<CallExpression>()
    where callExpression.Children[0].Text == "foo"
    select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;

tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
    tokenStream.Replace(
        (array.Children[i] as StringLiteral).Token.TokenIndex, 
        i.ToString());
}

var rewrittenCode = tokenStream.ToString();
like image 44
Jacob Avatar answered Sep 28 '22 01:09

Jacob