Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Syntax Parsing in Text Editors

How do text editors perform syntax highlighting? I know that vim uses simple regular expressions with special extensions to make them more powerful for differentiating syntax elements, but I also know that some other text editors like TextMate allow you to define full parsers. TextMate isn't known to perform well on large files, but Sublime Text supposedly performs better than vim on large files, and yet supports legacy TextMate syntax parsers. Are there some interesting hacks that it employs to avoid performing a top-to-bottom parse of the file or does it just employ a very efficient parsing algorithm?

like image 279
adrusi Avatar asked Dec 21 '22 14:12

adrusi


1 Answers

I wrote a text editor once. I thought I could do better than others. Then I learnt Vim and realized I was wrong :P Parts of my highlighting engine still exist on GitHub.

Several approaches are possible. You could write real lexical analysis (or shallow parsing) routines, but regular expressions may actually be faster if you use them effectively and you're not an expert in source parsing theory. I used a mix of the two.

To get good performance, editors are extremely unlikely to be highlighting the entire file. Instead, just highlight the visible region of the file, so you're minimizing the work done. Of course, then you have to consider what happens as the user starts editing somewhere in the middle of that visible region. My approach was to keep a snapshot of the lexer state (i.e. placement of all tokens and lexical states) in memory all the time, then from the cursor, walk backwards one or two tokens, use the lexer state at that point (i.e. keep the tokens and state stack to the left, and discard the ones to the right) and restart the highlighter from that point until the end of the visible range. Because all (I think) source languages are read from left-to-right, highlighting of tokens further to the left of the edited region should never change.

EDIT | Just re-reading my source, there were some other optimizations I made along the way. Long lists of keywords (e.g. built-in function names) are expensive to check. I built them into a radix tree which had a huge performance gain.

like image 159
d11wtq Avatar answered Jan 18 '23 03:01

d11wtq