Best way to save lots of position pairs in Emacs Lisp

Question

I am in the middle of transforming a usual full-restart tokenizer (ported from the original language parser, the language is irrelevant here) into a more advanced incremental one. This implies the following:

a) it has to be fast, very fast;

b) upon every text update (be it insertion or deletion) it has to find damaged tokens and repair the token list accordingly.

The original tokenizer version just builds a list of tokens while going through the buffer text using regexps; every token in the list is a vector of 4 elements (['TOKENTYPE "token-lexeme" linum charnum]). Linum/charnum are plain numbers specifying token position in the buffer at the moment lexing was done. Easy pie.

Now, to the point. Whenever (well.. not every time, but often enough) a user adds or deletes a character (or a string) the new tokenizer has to find a token built using the text in a change position, and, probably, dependant tokens for later deletions/updates.

There are two problems here:

a) token positions should be dynamic (i.e. if a user adds some text in the beginning of a buffer -> we shouldn't bother fixing tokens in the buffer end);

b) a way to get to a damaged token(or lots of tokens in general).

Right now I am trying to use overlays for the task as overlays have a nice interface suiting my needs: overlays-at/overlays-in functions help the search; and overlay start/end move in an apropriate way.

I can happily do that for a smaller file. But it turns out (and I have to admit that I was warned by the docs) that the solution does not scale: even an average 1K LOC file can have CONST * LOC overlays, which is just too much for Emacs.

That was my first try, and it wasn't a successful one. I am considering alternatives such as:

1) managing a handwritten token search tree using plain numbers;

2) same tree, but using markers;

3) some kind of a mixed approach including both plain numbers and markers.

Any alternatives to mentioned approaches? Or maybe there's a better way to handle lots of overlays?

Stefan · Accepted Answer

Like Lindydancer, I'd suggest you use text-properties rather than overlays. Overlays scale like O(N^2) whereas text-properties scale like O(N log N), so it works much better. I wouldn't use font-lock for any of it, tho.

Of course, an alternative solution is to fix overlays: the C code could be changed to make it O(N log N). I've known how to do that for a while now, but haven't found the time (and am unlikely to find the time in any foreseeable future), so if someone's interested I'd be very happy to help him do it.

Lindydancer · Answer

An alternative to overlays are text properties, they are attached to the text in a way that overlays aren't, so they are much more efficient.

A package that use text properties heavily is font-lock. Normally, it's used for highlighting buffers, but there is nothing that prevents you from piggy-backing on it for your own purposes. That way you will get the entire system for detecting that the user has modified the content of the buffer for free.

In your case, you could replace the regexp in a font-lock keyword with a function that will be called with a search limit. All you would need to do is to scan the relative short section, set your own text properties and you are done. (Also, you must inform font-lock which property you are setting using font-lock-extra-managed-props.)

Best way to save lots of position pairs in Emacs Lisp

Tags:

emacs

tokenize

elisp

overlays

Vlad

2 Answers

Stefan

Lindydancer

Recent Activity

Donate For Us

Best way to save lots of position pairs in Emacs Lisp

Tags:

emacs

tokenize

elisp

overlays

Vlad

2 Answers

Stefan

Lindydancer

Related questions

Recent Activity

Donate For Us