Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient data structure for very long array of characters with insertions

What data structure and/or algorithm would be good for implementing an array of characters with insertion. The typical workload would be a loop of few "random" reads followed by an "random" insert. The array would be huge, on the order of gigabytes.

Edit: In an extreme case the algorithm needs to be able to build up a gigabyte string with single byte insertions efficiently.

like image 712
Martin Avatar asked Jan 25 '14 15:01

Martin


1 Answers

Since this data structure should allow various extreme cases (like inserting either long strings or lots of single characters, representing a very long array, possibly memory constrained), it is unlikely that a single well-known data structure will suit all the needs.

You could consider these two data structures:

  1. Rope. It works well if you insert not too many relatively long strings.
  2. Tiered vector. It has no problems with lots of single character (or short string) insertions, needs very little additional memory, allows very fast random reads. But it has pretty bad time complexity for insertions: O(sqrt(n)) (which could be improved if we allow some more additional memory). And it is inferior to rope when inserting long strings.

Or you could use a custom-made data structure based either on rope or on tiered vector.

If you expect too many small insertions and want to avoid too many rope splits, you could use a tree of arrays; insert to the middle of array while it is short; but if size of the array grows to some limit, next insert should split it, just like in rope. Arrays referenced by tree nodes should be rather large (about 1000 bytes or more) to obtain better balance between very expensive cache misses while accessing tree nodes (so we should minimize number of nodes that do not fit in cache) and slightly less expensive memmoves.

Most appropriate memory allocation scheme for these arrays is like this: when array does not fit to its allocated space, split it into 2 equal parts, allocate some fixed number of bytes (like 2000) for each half, then copy each half to the middle of allocated space. While inserting characters near to the end of this array, move tail characters to the right. When inserting characters near to the beginning, move preceding characters to the left. So average length of memmove is only 1/4 of average array length. Two neighbor allocation spaces could share unused bytes between them, so we need to make a split only when bytes of one chunk are about to overwrite used bytes of other chunk. This approach is simple but needs some additional space to allow array growth. We could use some general-purpose allocator to get only space actually used by arrays (or allow very limited growth space), but it is much slower and most likely leads to memory fragmentation and even larger amount of unused memory. Possibly better way to save some memory is to use several fixed allocation spaces (like 1500, 1700, 2000) and reserve some fixed amount (determined experimentally) of chunks of each size. Other way to save memory is (instead of splitting one 2000-byte array into two 1000-byte arrays) merge two adjacent arrays (like 2000+1600), then split the result into three arrays (1200+1200+1200).

You've mentioned "packing bits together" to reduce RAM usage. That's not impossible with such a data structure (if your data is compressible). Actually two compression algorithms may be used here without sacrificing too much performance: Huffman coding or LZ4. For Huffman coding you need a static frequency table (precomputed in advance). For reading you'll need to decode only about 1/4 of average array size go get to proper position plus length of the string to read. For insertion you'll need to decode the same 1/4 of average array size, then move bitstream of the same size, then encode the inserted string. With LZ4 there is no need to deal with bitstreams, only whole bytes are used; probably it is worth to increase size of the arrays to get better compression.

Tiered vector may be optimized to be more cache friendly. Add about 100-200 bytes of reserved space to each block. With each insertion memmove bytes to this space. And only after there is no space for next memmove, start exchanging data between blocks (not by single byte as in original data structure, but 100-200 bytes at once).

To improve O(sqrt(n)) insertion time, consider tiered vector as special case of a trie, with only 2 levels. We could add one more level. Then (after insertion) we could stop inter-block data exchange when we come to the end of second-level block, allocate additional first-level block, and place extra bytes there. When one or several additional blocks are filled up, we could continue data exchange on third-level block (root of the trie). In theory this may be extended to log(n) levels to guarantee O(log(n)) single-character inserts and reads. But in practice probably 3 or 4 levels is better choice, so that we have O(n1/3) or O(n1/4) amortized insertion complexity.

like image 190
Evgeny Kluev Avatar answered Sep 22 '22 03:09

Evgeny Kluev