Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pygments lexer state preservation

Running pygments default lexer on the following c++ text: class foo{};, results in this:

(Token.Keyword, 'class')
(Token.Text, ' ')
(Token.Name.Class, 'foo')
(Token.Punctuation, '{')
(Token.Punctuation, '}')
(Token.Punctuation, ';')

Note that the toke foo has the type Token.Name.Class.

If i change the class name to foobar i want to be able to run the default lexer only on the touched tokens, in this case original tokens foo and {.

Q: How can i save the lexer state so that tokenizing foobar{ will give tokens with type Token.Name.Class?

Having this feature would optimize syntax highlighting for large source files that suffered a change (user is typing text) right in the middle of the file for example. There seems no documented way of doing this and no information on how to do this using the default pygments lexers.

Are there any other syntax highlighting systems that have support for this behavior ?

EDIT:

Regarding performance here is an example: http://tpcg.io/ESYjiF

like image 478
Raxvan Avatar asked Jun 20 '18 08:06

Raxvan


1 Answers

From my understanding of the source code what you want is not possible.

I won't dig and try to explain every single relevant lines of code, but basically, here is what happend:

  • Your Lexer class is pygments.lexers.c_cpp.CLexer, which inherits from pygments.lexer.RegexLexer.
  • pygments.lex(lexer, code) function do nothing more than calling get_tokens method on lexer and handle errors.
  • lexer.get_tokens basically parse source code in unicode string and call self.get_tokens_unprocessed
  • get_tokens_unprocessed is defined by each Lexer in your case the relevant method is pygments.lexers.c_cpp.CFamilyLexer.get_tokens_unprocessed.
  • CFamilyLexer.get_tokens_unprocessed basically get tokens from RegexLexer.get_tokens_unprocessed and reprocess some of them.

Finally, RegexLexer.get_tokens_unprocessed loop on defined token types (something like (("function", ('pattern-to-find-c-function',)), ("class", ('function-to-find-c-class',)))) and for each type (function, class, comment...) find all matches within the source text, then process the next type.

This behavior make what you want impossible because it loops on token types, not on text.


To make more obvious my point, I added 2 lines of code in the lib, file: pygments/lexer.py, line: 628

for rexmatch, action, new_state in statetokens:
    print('looking for {}'.format(action))
    m = rexmatch(text, pos)
    print('found: {}'.format(m))

And ran it with this code:

import pygments
import pygments.lexers

lexer = pygments.lexers.get_lexer_for_filename("foo.h")
sample="""
class foo{};
"""
print(list(lexer.get_tokens(sample)))

Output:

[...]
looking for Token.Keyword.Reserved
found: None
looking for Token.Name.Builtin
found: None
looking for <function bygroups.<locals>.callback at 0x7fb1f29b52f0>
found: None
looking for Token.Name
found: <_sre.SRE_Match object; span=(6, 9), match='foo'>
[...]

As you can see, the token types are what the code iterate on.


Taking that and (as Tarun Lalwani said in comments) the fact that a single new character can break the whole source-code structure, you cannot do better than re-lexing the whole text at each update.

like image 177
Arount Avatar answered Oct 18 '22 00:10

Arount