I'm trying to do some basic code diff with the Roslyn API, and I'm running into some unexpected problems. Essentially, I have two pieces of code that are the same, except one line has been added. This should just return the line of the changed text, but for some reason, it's telling me that everything has changed. I have also tried just editing one line instead of adding a line, but I get the same result. I would like to be able to apply this to two versions of a source file to identify differences between the two. Here's the code I'm currently using:
SyntaxTree tree = SyntaxTree.ParseCompilationUnit(
@"using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace HelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(""Hello, World!"");
}
}
}");
var root = (CompilationUnitSyntax)tree.Root;
var compilation = Compilation.Create("HelloWorld")
.AddReferences(
new AssemblyFileReference(
typeof(object).Assembly.Location))
.AddSyntaxTrees(tree);
var model = compilation.GetSemanticModel(tree);
var nameInfo = model.GetSemanticInfo(root.Usings[0].Name);
var systemSymbol = (NamespaceSymbol)nameInfo.Symbol;
SyntaxTree tree2 = SyntaxTree.ParseCompilationUnit(
@"using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace HelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(""Hello, World!"");
Console.WriteLine(""jjfjjf"");
}
}
}");
var root2 = (CompilationUnitSyntax)tree2.Root;
var compilation2 = Compilation.Create("HelloWorld")
.AddReferences(
new AssemblyFileReference(
typeof(object).Assembly.Location))
.AddSyntaxTrees(tree2);
var model2 = compilation2.GetSemanticModel(tree2);
var nameInfo2 = model2.GetSemanticInfo(root2.Usings[0].Name);
var systemSymbol2 = (NamespaceSymbol)nameInfo2.Symbol;
foreach (TextSpan t in tree2.GetChangedSpans(tree))
{
Console.WriteLine(tree2.Text.GetText(t));
}
And here's the output I'm getting:
System
using System
Collections
Generic
using System
Linq
using System
Text
namespace HelloWorld
{
class Program
{
static
Main
args
{
Console
WriteLine
"Hello, World!"
Console.WriteLine("jjfjjf");
}
}
}
Press any key to continue . . .
Interestingly, it seems to show each line as tokens for every line except for the added line, where it displays the line without breaking it up. Does anyone know how to isolate the actual changes?
Bruce Boughton's guess is correct. The GetChangedSpans method is not intended to be a general-purpose syntax diffing mechanism to take the difference between two syntax trees that have no shared history. Rather, it is intended to take two trees that have been produced by edits to a common tree, and determine which portions of the trees are different because of edits.
If you had taken your first parse tree and inserted the new statement into it as an edit, then you would see a far smaller set of changes.
It might help if I briefly describe how the Roslyn lexer and parser work, at a high level.
The basic idea is that lexer-produced "syntax tokens" and parser-produced "syntax trees" are immutable. They never change. Because they never change, we can re-use parts of previous parse trees in new parse trees. (Data structures which have this property are often called "persistent" data structures.)
Because we can re-use existing parts, we can, for example, use the same value for every instance of a given token, say class
, that appears in the program. The length and content of every class
token is exactly the same; the only things that distinguish two different class
tokens are their trivia, (what spacing and comments surround them) and their position, and their parent -- what larger syntax node contains the token.
When you parse a block of text we generate syntax tokens and syntax trees in a peristent, immutable form, which we call the "green" form. We then wrap up the green nodes in a "red" layer. The green layer knows nothing about position, parents, and so on. The red layer does. (The whimsical names are due to the fact that when we first drew this data structure on a whiteboard, those are the colours that we used.) When you create an edit to a given syntax tree, we look at the previous syntax tree, identify the nodes which changed, and then build new nodes only on the spine of the changes. All the other branches of the green tree stay the same.
When diffing two trees, basically what we do is take the set difference of the green nodes. If one of the trees was produced by editing the other, then almost all the green nodes will be the same because only the spine was rebuilt. The tree diffing algorithm will identify the changed nodes and work out the affected spans.
If the two trees have no history in common then the only green nodes they'll have in common are the individual tokens, which, as I said before, are re-used everywhere. Every higher-level green syntax node will be a different green node, and therefore be treated as different by the tree difference engine, even if its text is the same.
The purpose of this method is to allow the editor code to rapidly make a conservative guess about what portions of a text buffer need to be, say, recolourized, after an edit, or an undo, or some such thing. The assumption is that the trees have a historical relationship. The intention is not to provide a general-purpose textual difference mechanism; there are plenty of great tools for that already.
Imagine, for example, that you had pasted your first program into the editor, then highlighted the whole thing, then pasted the second program into the editor. One would reasonably expect that the editor would not waste time trying to figure out what portions of the pasted-down code happened to be identical with the previously-pasted code. That could be very expensive and the answer is likely to be "not much". Rather, the editor makes the conservative assumption that the entire pasted-over region is brand-new and entirely different code. It doesn't spend any time trying to make correspondences between the old code and the new code; it reparses and therefore recolourizes the whole thing.
If, on the other hand you had just pasted in the single different statement, then the editing engine would simply insert the edit into the right place. The parse tree would be regenerated re-using the existing green nodes where possible, and the difference engine would identify what spans need to be re-colourized: the ones with different green nodes.
Does that all make sense?
Ha, apparently Kevin and I were both typing out the same answer at the same time, in adjoining offices. A bit of duplicated effort, but I think both answers have good perspectives on the situation. :-)
@bruceboughton is right, GetChangedSpans
is intended to discover changes made by the incremental parser. With code like the following, I get much better output:
var code =
@"using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace HelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(""Hello, World!"");
}
}
}";
var text = new StringText(code);
SyntaxTree tree = SyntaxTree.ParseCompilationUnit(text);
var index = code.IndexOf("}");
var added = @" Console.WriteLine(""jjfjjf"");
";
var code2 = code.Substring(0, index) +
added +
code.Substring(index);
var text2 = new StringText(code2);
var tree2 = tree.WithChange(text2, new [] { new TextChangeRange(new TextSpan(index, 0), added.Length) } );
foreach (var span in tree2.GetChangedSpans(tree))
{
Console.WriteLine(text2.GetText(span));
}
However, in general GetChangedSpans is meant to be a reasonaby fast but conservative diff. For more control over the diff, and more accurate results, you probably want to implement your own tree-diffing algorithm that you can tune to meet your needs.
In the code above, if you are using VS, the editor has built in change reporting and text diffing that will allow you to easily construct the TextChangeRange
objects, but otherwise you'll probably still need at least a text diffing algorithm if you want to be able to pass changes to the incremental parser.
I would guess that GetChangedSpans
is intended to compare changes between a tree and trees created from changes to the original tree, rather than between two arbitrary trees.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With