Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient data structure to add styles to text

I'm looking for the best data structure to add styles to a text (say in a text editor). The structure should allow the following operations:

  1. Quick lookup of all styles at absolute position X
  2. Quick insert of text at any position (styles after that position must be moved).
  3. Every position of the text must support an arbitrary number of styles (overlapping).

I've considered lists/arrays which contain text ranges but they don't allow quick insert without recalculating the positions of all styles after the insert point.

A tree structure with relative offsets supports #2 but the tree will degenerate fast when I add lots of styles to the text.

Any other options?

like image 556
Aaron Digulla Avatar asked Nov 15 '10 15:11

Aaron Digulla


1 Answers

I have never developped an editor, but how about this:

I believe it would be possible to expand the scheme that is used to store the text characters themeselves, depending of course on the details of your implementation (language, toolkits etc) and your performance and resource usage requirements.

Rather than use a separate data structure for the styles, I'd prefer having a reference that would accompany each character and point to an array or list with the applicable characters. Characters with the same set of styles could point to the same array or list, so that one could be shared.

Character insertions and deletions would not affect the styles themeselves, apart from changing the number of references to them, which could be handled with a bit of reference counting.

Depending on your programming language you could even compress things a bit more by pointing halfway into a list, although the additional bookkeeping for this might in fact make it more inefficient.

The main issue with this suggestion is the memory usage. In an ASCII editor written in C, bundling a pointer with each char would raise its effective memory usage from 1 byte to 12 bytes on a 64 bit system, due to struct alignment padding.

I would look about breaking the text into small variable size blocks that would allow you to efficiently compress the pointers. E.g. a 32-character block might look like this in C:

struct _BLK_ {
    unsigned char size;
    unsigned int styles;
    char content[];
}

The interesting part is the metadata processing on the variable part of the struct, which contains both the stored text and any style pointers. The size element would indicate the number of characters. The styles integer (hence the 32-character limit) would be seen as a set of 32 1-bit fields, with each one indicating whether a character has its own style pointer, or whether it should use the same style as the previous character. This way a 32-char block with a single style would only have the additional overhead of the size char, the styles mask and a single pointer, along with any padding bytes. Inserting and deleting characters into a small array like this should be quite fast.

As for the text storage itself, a tree sounds like a good idea. Perhaps a binary tree where each node value would be the sum of the children values, with the leaf nodes eventually pointing to text blocks with their size as their node value? The root node value would be the total size of the text, with each subtree ideally holding half of your text. You'd still have to auto-balance it, though, with sometimes having to merge half-empty text blocks.

And in case you missed it, I am no expert in trees :-)

EDIT:

Apparently what I suggested is a modified version of this data structure:

http://en.wikipedia.org/wiki/Rope_%28computer_science%29

as referenced in this post:

Data structure for text editor

EDIT 2:

Deletion in the proposed data structure should be relatively fast, as it would come down to byte shifting in an array and a few bitwise operations on the styles mask. Insertion is pretty much the same, unless a block fills up. It might make sense to reserve some space (i.e. some bits in the styles mask) within each block to allow for future insertions directly in the blocks, without having to alter the tree itself for relatively small amounts of new text.

Another advantage of bundling characters and styles in blocks like this is that its inherent data locality should allow for more efficient use of the CPU cache than other alternatives, thus improving the processing speed to some extent.

Much like any complex data structure, though, you'd probably need either profiling with representative test cases or an adaptive algorithm to determine the optimal parameters for its operation (block size, any reserved space etc).

like image 89
thkala Avatar answered Nov 22 '22 06:11

thkala