Best way to represent language tokens for an autocompletion scenario

Question

As some of you know, I'm developing my own IDE. You might think "oh no, another one?!" - don't worry, no one's forcing you to use it, and I doubt it will be seriously published anyway.

So, onwards to the main issue. I'm trying to implement an autocompletion system. The exact UI is not the concern. However, storing language/library tokens in a flexible way, is my main problem.

Let's say we're suggesting CSS selectors OR attributes to the user. We'd have something like:

- css/core
  - a                      // anchor tag
  - etc                    // all valid html tags
  - .stuff                 // class name parsed from user project
  - ?etc                   // more stuff parsed from user project (ids, classes...)
- css/properties
  - border                 // regular css properties - we also need to associate
                           // <border-style> and <color> value tokens
  - etc                    // the rest of them
- css/values/border-style  // property value tokens
  - solid
  - dotted
- css/values/color
  - red
  - green
  - fucshia

So each token gets a namespace so we can track between tokens. Similarly to BNF, some token values are made up of subtokens such as the case for border and color.

1. Don't forget that we need to store anything that might relate to languages with exotic syntax. 2. Also, it is important to note that I will need to somehow merge the above information with context-dependent one, such as a list of class names gathered from the project's files. This should be fast and efficient, without causing any duplicate tokens etc.

So, to conclude, the thing here is very complicated, and I can't honestly think of a way to get a general and flexible solution. Keep in mind the IDE should cater for any kind of language, making this even more complicated.

I'm not sure if this question is better suited in, for example, programmers, so I'll leave it up to mods to decide.

Martin Konicek · Accepted Answer

I worked on an IDE called SharpDevelop. Let me start with a more general discussion before I get to the storage question.

I don't think you can solve autocompletion properly in a generic way. Most IDEs support various languages by having a plugin for each of the languages and it is entirely up to the plugin to figure out what a completion list should look like based on the current position of the cursor in the document.

The IDE only provides a simple interface that the plugins implement. For example, the code in the IDE showing autocompletion could look like this:

getAutocompletionList(editor) {
  plugin = editor.languagePlugin;
  plugin.getAutocompletionList(editor.cursorPosition, editor.parsedDocument);
}

A CSSLanguagePlugin and PHPLanguagePlugin would then have completely separate implementations of getAutocompletionList - one would be used when editing CSS, the other when editing PHP.

As others pointed out, the context around the cursor is important. For example, when editing the following CSS:

h1 {
    text-align: <cursor>

The contexts would be:

[cssTopLevelContext] {
    [cssPropertyContext]: [cssPropertyValueContext]
}

The implementation of the CSS plugin would then do the following:

// CSSLanguageBinding
getAutocompletionList(cursorPosition, document) {
    completionContext = this.getCompletionContext(cursorPosition, document);
    // completionContext is { 
    //     'name': 'cssPropertyValueContext', 
    //     'propertyName': 'text-align' 
    // }
    return this.completionDatabase.getCompletionList(completionContext);
    // returns ['left', 'center', 'right'];
}

Now we get to your question - the completion database. Again, it could (and probably should) be a different implementation for different language plugins - in PHP you work with classes, methods and variables, and have to care about visibility (private, public, protected). In CSS you work with tags, classes and properties.

As you correctly pointed out, the completion database should consist of:

common tokens
tokens imported by the current project
tokens in the current project itself

In SharpDevelop, the 'common tokens' part is not there, as any project imports the standard library, so it enough to analyze all the imported libraries when opening a project.

In PHP you could do the same and you could cache the token database for already seen libraries.

Now we get to the storage format. To offer autocompletion in PHP, you will need to know the current class, its base class and interface hierarchy, methods in all the base classes and interfaces and their visibility, variables visible in current context and their types (not always possible in PHP) and so on.

For this reason, I think a relational database is not a good choice. How will you store all the classes, interfaces and methods there and navigate the inheritance hierarchy? SharpDevelop stores all this in memory as object model (Class has a base type, list of interfaces, list of members etc.). 8000 items is not a very large number, and if you stored 8000 items in a relational database, it would be so small that the database engine would keep it all in RAM anyway.

SharpDevelop keeps all the completion information in memory and when you open a 700K line project in SharpDevelop, the memory consumption is still pretty low. I suggest you initialize your autocompletion data structures upon opening of a project and keep them in memory. As others said, you have to update them in the background as user is typing (introducing new methods, renaming fields, etc.).

So that's it for PHP. For CSS, a data structure similar to what you outlined in your question seems very reasonable. You could load this into memory from a structured file upon the start of the IDE / opening a project / opening the first CSS file.

As an end note, implementing good autocompletion for CSS shouldn't be that hard. For PHP, it will be much more difficult and you could start with something simple - offering the 8000 tokens from the standard library plus offering words that the user typed somewhere else in the project. Such approach is used by editors like Sublime Text and works surprisingly well.

Best way to represent language tokens for an autocompletion scenario

Tags:

language-agnostic

autocomplete

token

Christian

1 Answers

Martin Konicek

Recent Activity

Donate For Us

Best way to represent language tokens for an autocompletion scenario

Tags:

language-agnostic

autocomplete

token

Christian

1 Answers

Martin Konicek

Related questions

Recent Activity

Donate For Us