Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the release version of Roslyn implement immutable trees?

I understand that the pre-release version of Roslyn implemented immutable trees as described in this excellent blog post by Eric Lippert. However, that post ends with:

The cost is that this system is complex and can consume a lot of memory if the "red" facades get large. We are at present doing experiments to see if we can reduce some of the costs without losing the benefits.

I'd like to ask what the outcome was for the release version. I've started examining the Roslyn sources but the code is fairly complex to follow.

What I'm interested in are the low-level design results regarding the costs mentioned above.

  1. What is the cost of a single edit in terms of red/green nodes?
  2. What optimizations have been made for other operations (e.g. delete, undo)?
like image 487
bright Avatar asked Sep 21 '14 20:09

bright


2 Answers

There is a fantastic look at Roslyn's performance and implementation of immutable trees by VSadov on discussion forums here: https://roslyn.codeplex.com/discussions/541953

At a high level, he discusses concurrency, de-duplication of green nodes, de-duplication of terminals and de-duplication of strings within theses trees.

He also talks about the laziness of red trees. Instead of computing the red tree after every text change, the red tree is computed only when someone requests it.

Finally, he discusses the red tree and the fact that it's weak. I'd never used or seen the WeakReference class, and VSadov provides a good overview of how it's used within the red tree. Basically, the garbage collector is allowed to clean up portions of the red tree and they can be recreated later if needed. I'm not familiar with the implementation, but Eric Lippert notes that the red tree facade can result in a large memory footprint. I imagine these WeakReferences help mitigate this problem to some degree.

like image 105
JoshVarty Avatar answered Oct 04 '22 08:10

JoshVarty


At present we are still doing basically the same thing that Eric described. We tried some experiments, but found that the cost to API clarity was too high to pay. Instead, the main change we made was to reduce heap allocations and GC costs by making SyntaxToken into a struct in the red model. This is a significant savings, since in average source files, roughly half of the nodes in the syntax tree are terminals (SyntaxTokens). The green nodes are not structs, but on the other and, since they do not contain parent pointers or positions, they are interned - we share the same instance of the green node for all identical nodes (same trivia and text).

I'm not sure what you mean by "cost" of an edit. Time/space/complexity/etc. In general we have an incremental lexer which rescans the area affected by the edit. Then we have an incremental parser, which in the best case reparses the statement which intersect with the newly lexed tokens and then re-builds the "spine" of the green tree back up to the root, while re-using the rest of the green nodes in the tree. None of the red nodes are reusable by definition. There is a new root node for the new tree, and it is reachable from any other node.

No other optimizations have been done in the compiler. We re-use a cached tree in the IDE after an undo, but I'm not sure.

like image 23
Kevin Pilch Avatar answered Oct 04 '22 08:10

Kevin Pilch