Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What data structure to use here

Hashes provide an excellent mechanism to extract values corresponding to some given key in almost O(1) time. But it never preserves the order in which the keys are inserted. So is there any data structure which can simulate the best of array as well as hash, that is, return the value corresponding to a given key in O(1) time, as well as returning the nth value inserted in O(1) time? The ordering should be maintained, i.e., if the hash is {a:1,b:2,c:3}, and something like del hash[b] has been done, nth(2) should return {c,3}.

Examples:

hash = {};
hash[a] = 1;
hash[b] = 2;
hash[c] = 3;
nth(2); //should return 2
hash[d] = 4;
del hash[c];
nth(3); //should return 4, as 'd' has been shifted up

Using modules like TIE::Hash or similar stuff won't do, the onus is on me to develop it from scratch!

like image 260
SexyBeast Avatar asked Oct 19 '12 08:10

SexyBeast


2 Answers

It depends on how much memory may be allocated for this data structure. For O(N) space there are several choices:

  • It's easy to get a data structure with O(1) time for each of these operations: "get value by key", "get nth value inserted", "insert" - but only when "delete" time is O(N). Just use combination of a hash map and an array, as explained by ppeterka.
  • Less obvious, but still simple is O(sqrt N) for "delete" and O(1) for all other operations.
  • A little bit more complicated is to "delete" in O(N1/4), O(N1/6), or, in general case, in O(M*N1/M) time.
  • It's, most likely, impossible to decrease "delete" time to O(log N) while retaining O(1) for other operations. But it is possible if you agree to O(log N) time for every operation. Solutions, based on binary search tree or on a skip list, allow it. One option is order statistics tree. You can augment every node of a binary search tree with a counter, storing number of elements in the sub-tree under this node; then use it to find nth node. Other option is to use Indexable skiplist. One more option is to use O(M*N1/M) solution with M=log(N).
  • And I don't think you can get O(1) "delete" without increasing time for other operations even more.

If unlimited space is available, you can do every operation in O(1) time.


O(sqrt N) "delete"

You can use a combination of two data structures to find value by key and to find value by its insertion order. First one is a hash map (mapping key to both value and a position in other structure). Second one is tiered vector, which maps position to both value and key.

Tiered vector is a relatively simple data structure, it may be easily developed from scratch. Main idea is to split array into sqrt(N) smaller arrays, each of size sqrt(N). Each small array needs only O(sqrt N) time to shift values after deletion. And since each small array is implemented as circular buffer, small arrays can exchange a single element in O(1) time, which allows to complete "delete" operation in O(sqrt N) time (one such exchange for each sub-array between deleted value and first/last sub-array). Tiered vector allows insertion into the middle also in O(sqrt N), but this problem does not require it, so we can just append a new element at the end in O(1) time. To access element by its position, we need to determine starting position of circular buffer for sub-array, where element is stored, then get this element from circular buffer; this needs also O(1) time.

Since hash map remembers a position in tiered vector for each of its keys, it should be updated when any element in tiered vector changes position (O(sqrt N) hash map updates for each "delete").


O(M*N1/M) "delete"

To optimize "delete" operation even more, you can use approach, described in this answer. It deletes elements lazily and uses a trie to adjust element's position, taking into account deleted elements.


O(1) for every operation

You can use a combination of three data structures to do this. First one is a hash map (mapping key to both value and a position in the array). Second one is an array, which maps position to both value and key. And third one is a bit set, one bit for each element of the array.

"Insert" operation just adds one more element to the array's end and inserts it into hash map.

"Delete" operation just unsets corresponding bit in the bit set (which is initialized with every bit = 1). Also it deletes corresponding entry from hash map. (It does not move elements of array or bit set). If, after "delete" the bit set has more than some constant proportion of elements deleted (like 10%), the whole data structure should be re-created from scratch (this allows O(1) amortized time).

"Find by key" is trivial, only hash map is used here.

"Find by position" requires some pre-processing. Prepare a 2D array. One index is the position we search. Other index is current state of our data structure, the bit set, reinterpreted as an index. Calculate population count for each prefix of every possible bit set and store prefix length, indexed by both population count and the bit set itself. Having this 2D array ready, you can perform this operation by first indexing by position and current "state" in this 2D array, then by indexing in the array with values.

Time complexity for every operation is O(1) (for insert/delete it is O(1) amortized). Space complexity is O(N 2N).

In practice, using whole bit set to index an array limits allowed value of N by pointer size (usually 64), even more it is limited by available memory. To alleviate this, we can split both the array and the bit set into sub-arrays of size N/C, where C is some constant. Now we can use a smaller 2D array to find nth element in each sub-array. And to find nth element in the whole structure, we need additional structure to record number of valid elements in each sub-array. This is a structure of constant size C, so every operation on it is also O(1). This additional structure may me implemented as an array, but it is better to use some logarithmic-time structure like indexable skiplist. After this modification, time complexity for every operation is still O(1); space complexity is O(N 2N/C).

like image 179
Evgeny Kluev Avatar answered Sep 23 '22 15:09

Evgeny Kluev


Now, that the question is clear for me too (better late than never...) here are my proposals:

  • you could maintain two hashes: one with keys, and one with the insert order. this however is very ugly and slow to maintain when deleting, and inserting in between. This would give the same almost O(1) time needed to access the elements both ways.
  • you could use a hash for the keys, and maintain an array for the insert order. this one is a lot nicer than the hash type, deleting is still not very fast, but I think still a lot quicker than with the two hash approach. This also gives true O(1) on accessing the nth element.

At first, I misunderstood the question, and gave a solution that gives O(1) key lookup, and O(n) lookup of nth element:

In Java, there is the LinkedHashMap for this particular task.

I think however that if someone finds this page, this might not be totally useless, so I leave it here...

like image 42
ppeterka Avatar answered Sep 22 '22 15:09

ppeterka