Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a syntax highlighter

I was hoping to write my own syntax highlighter for a summer project I am thinking of working on but I am not sure how to write my own syntax highlighter.

I know that there are bunch of implementations out there but I would like to learn about regular expressions and how syntax highlighting works.

How does syntax highlighting work and what are some good references for developing one? Does the syntax highlighter scan each character as it is typed or does it scan the document/text area as a whole after each character is typed?

Any insight would be greatly appreciated.

Thanks.

PS: I was planning on writing it in ActionScript

like image 984
Ian Dallas Avatar asked Apr 30 '09 22:04

Ian Dallas


2 Answers

Syntax highlighters can work in two very general ways. The first implements a full lexer and parser for the language(s) being highlighted, exactly identifying each token's type (keyword, class name, instance name, variable type, preprocessor directive...). This provides all the information needed to exactly highlight the code according to some specification (keywords in red, class names in blue, what have you).

The second way is something like the one Google Code Prettify employs, where instead of implementing one lexer/parser per language, a couple of very general parsers are used that can do a decent job on most syntaxes. This highlighter, for example, will be able to parse and highlight reasonably well any C-like language, because its lexer/parser can identify the general components of those kinds of languages.

This also has the advantage that, as a result, you don't need to explicitely specify the language, as the engine will determine by itself which of its generic parsers can do the best job. The downside of course is that highlighting is less perfect than when a language-specific parser is used.

like image 114
David Anderson Avatar answered Oct 05 '22 04:10

David Anderson


Building a syntax highlighter is all about finding specific keywords in the code and giving them a specific style (font, font style, colour etc.). In order to achieve this, you will need to define a list of keywords specific to the programming language in which the code is written, and then parse the text (e.g. using regular expressions), find the specific tokens and replace them with properly-styled HTML tags.

A very basic highligher written in JavaScript would look like this:

var keywords = [ "public", "class", "private", "static", "return", "void" ]; for (var i = 0; i < keywords.length; i++) {         var regex = new RegExp("([^A-z0-9])(" + keywords[i] + ")([^A-z0-9])(?![^<]*>|[^<>]*</)", "g");         code = code.replace(regex, "$1<span class='rm-code-keyword'>$2</span>$3"); } 
like image 21
raimme Avatar answered Oct 05 '22 06:10

raimme