Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure for an ordered set with many defined subsets; retrieve subsets in same order

I'm looking for an efficient way of storing an ordered list/set of items where:

  1. The order of items in the master set changes rapidly (subsets maintain the master set's order)
  2. Many subsets can be defined and retrieved
  3. The number of members in the master set grow rapidly
  4. Members are added to and removed from subsets frequently
  5. Must allow for somewhat efficient merging of any number of subsets

Performance would ideally be biased toward retrieval of the first N items of any subset (or merged subset), and storage would be in-memory (and maybe eventually persistent on disk)

like image 439
Aaron Avatar asked Nov 06 '22 04:11

Aaron


1 Answers

I am a new member to this forum, I hope you have not forgotten about this old question :)

Solution

Store the master set in an indexed data structure --such as an array (or arraylist, if your library supports it). Let's assume you can associate an id with each set (if not, then how do you know which set to retrieve?). So, we now need a way to find out which elements of your array participate in that set and which ones don't.

Use a matrix (n x m) with n being the number of elements in your array and m being the initial number of sets. i refers to the row index and j refers to the column index.

A[i][j] = 0 if ith element is not in jth set
A[i][j] = 1 if ith element is in jth set

Do not use a simple two dimensional array, go for an ArrayList<ArrayList>. Java/C#/C++ support such generic constructs, but it shouldn't be terribly hard to do so in other languages such as Perl. In C# you can even use a DataTable.

Time to add a new set

You can add a new set in O(n) time. Simply add a new column for that set and set the appropriate rows to 1 for that column. There will be no need to sort this set as long as the original array is sorted.

Time to add a new element

In a simple sorted array, time for insertion is O(log n). In our case, we will first add element to the array (and at whatever index we added the element at, the matrix will also get an all 0 row at that index). Then we set entries in that column to 1 if the element belongs to a set. That way, the worst case runtime becomes O(log n) + O(m).

Time to fetch first N elements from a set

Pick up the column corresponding to the set in O(1) time and then pick the first N entries that are 1. This will be linear.

Time to merge two sets

Let's say we are merging sets at j1 and j2 into a third set at j3.

for (int i = 0; i < n - 1; i++) {
    A[i][j3] = A[i][j1] | A[i][j2];
}

This is again linear.

Time to remove an element

First find the element in the master array --this takes O(log n) time. Then remove it from that array and remove row at that index from the matrix.

Efficient deletions from the array

Don't simply remove, just mark them defunct. Upon a threshold number of defunct'd columns/rows, you can consolidate. Similarly, start with a high capacity initially for the arrays. Modern implementations should do this automatically though.

like image 178
Apoorv Avatar answered Nov 09 '22 06:11

Apoorv