Data Structure for fast position lookup

Tags:

Looking for a datastructure that logically represents a sequence of elements keyed by unique ids (for the purpose of simplicity let's consider them to be strings, or at least hashable objects). Each element can appear only once, there are no gaps, and the first position is 0.

The following operations should be supported (demonstrated with single-letter strings):

insert(id, position) - add the element keyed by id into the sequence at offset position. Naturally, the position of each element later in the sequence is now incremented by one. Example: [S E L F].insert(H, 1) -> [S H E L F]
remove(position) - remove the element at offset position. Decrements the position of each element later in the sequence by one. Example: [S H E L F].remove(2) -> [S H L F]
lookup(id) - find the position of element keyed by id. [S H L F].lookup(H) -> 1

The naïve implementation would be either a linked list or an array. Both would give O(n) lookup, remove, and insert.

In practice, lookup is likely to be used the most, with insert and remove happening frequently enough that it would be nice not to be linear (which a simple combination of hashmap + array/list would get you).

In a perfect world it would be O(1) lookup, O(log n) insert/remove, but I actually suspect that wouldn't work from a purely information-theoretic perspective (though I haven't tried it), so O(log n) lookup would still be nice.

346

asked Aug 18 '12 10:08

agnoster

1 Answers

A combination of trie and hash map allows O(log n) lookup/insert/remove.

Each node of trie contains id as well as counter of valid elements, rooted by this node and up to two child pointers. A bit string, determined by left (0) or right (1) turns while traversing the trie from its root to given node, is part of the value, stored in the hash map for corresponding id.

Remove operation marks trie node as invalid and updates all counters of valid elements on the path from deleted node to the root. Also it deletes corresponding hash map entry.

Insert operation should use the position parameter and counters of valid elements in each trie node to search for new node's predecessor and successor nodes. If in-order traversal from predecessor to successor contains any deleted nodes, choose one with lowest rank and reuse it. Otherwise choose either predecessor or successor, and add a new child node to it (right child for predecessor or left one for successor). Then update all counters of valid elements on the path from this node to the root and add corresponding hash map entry.

Lookup operation gets a bit string from the hash map and uses it to go from trie root to corresponding node while summing all the counters of valid elements to the left of this path.

All this allow O(log n) expected time for each operation if the sequence of inserts/removes is random enough. If not, the worst case complexity of each operation is O(n). To get it back to O(log n) amortized complexity, watch for sparsity and balancing factors of the tree and if there are too many deleted nodes, re-create a new perfectly balanced and dense tree; if the tree is too imbalanced, rebuild the most imbalanced subtree.

Instead of hash map it is possible to use some binary search tree or any dictionary data structure. Instead of bit string, used to identify path in the trie, hash map may store pointer to corresponding node in trie.

Other alternative to using trie in this data structure is Indexable skiplist.

O(log N) time for each operation is acceptable, but not perfect. It is possible, as explained by Kevin, to use an algorithm with O(1) lookup complexity in exchange for larger complexity of other operations: O(sqrt(N)). But this can be improved.

If you choose some number of memory accesses (M) for each lookup operation, other operations may be done in O(M*N^1/M) time. The idea of such algorithm is presented in this answer to related question. Trie structure, described there, allows easily converting the position to the array index and back. Each non-empty element of this array contains id and each element of hash map maps this id back to the array index.

To make it possible to insert element to this data structure, each block of contiguous array elements should be interleaved with some empty space. When one of the blocks exhausts all available empty space, we should rebuild the smallest group of blocks, related to some element of the trie, that has more than 50% empty space. When total number of empty space is less than 50% or more than 75%, we should rebuild the whole structure.

This rebalancing scheme gives O(MN^1/M) amortized complexity only for random and evenly distributed insertions/removals. Worst case complexity (for example, if we always insert at leftmost position) is much larger for M > 2. To guarantee O(MN^1/M) worst case we need to reserve more memory and to change rebalancing scheme so that it maintains invariant like this: keep empty space reserved for whole structure at least 50%, keep empty space reserved for all data related to the top trie nodes at least 75%, for next level trie nodes - 87.5%, etc.

With M=2, we have O(1) time for lookup and O(sqrt(N)) time for other operations.

With M=log(N), we have O(log(N)) time for every operation.

But in practice small values of M (like 2 .. 5) are preferable. This may be treated as O(1) lookup time and allows this structure (while performing typical insert/remove operation) to work with up to 5 relatively small contiguous blocks of memory in a cache-friendly way with good vectorization possibilities. Also this limits memory requirements if we require good worst case complexity.

answered Oct 20 '22 08:10

Evgeny Kluev

Related questions
                            
                                API Design: Caching “partial” nested objects
                            
                                Least number of perfect square numbers that sums upto n
                            
                                Search in Rotated Sorted Array in O(log n) time
                            
                                Monotonic stacks and queues. Definition and examples
                            
                                Good Data Structure for Unit Conversion? [closed]
                            
                                Languages with native / syntactical / inline graph support?
                            
                                Are any implementations of Bloomier filters available?
                            
                                Convert string to tree representation with rules
                            
                                Printing all possible words from a 2D array of characters
                            
                                Select combination of elements from array whose sum is smallest possible positive number
                            
                                compliant variable length struct in C++
                            
                                Java - LinkedList push() pop() implies it is a stack, not a queue?
                            
                                What data structure should I use to model a database/table?
                            
                                Why we usually divide in two parts in divide and conquer algorithms?
                            
                                Delete a node in singly linked list in Rust
                            
                                Data structure for Pattern Matching on large data
                            
                                Print binary tree in BFS fashion with O(1) space
                            
                                Find the Closest intersection point in plan
                            
                                Search for cyclic strings
                            
                                Implementing a functional/persistent dictionary data structure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data Structure for fast position lookup

Tags:

language-agnostic

data-structures

agnoster

People also ask

1 Answers

Evgeny Kluev

Recent Activity

Donate For Us