Data structure for an ordered set with many defined subsets; retrieve subsets in same order

Question

I'm looking for an efficient way of storing an ordered list/set of items where:

The order of items in the master set changes rapidly (subsets maintain the master set's order)
Many subsets can be defined and retrieved
The number of members in the master set grow rapidly
Members are added to and removed from subsets frequently
Must allow for somewhat efficient merging of any number of subsets

Performance would ideally be biased toward retrieval of the first N items of any subset (or merged subset), and storage would be in-memory (and maybe eventually persistent on disk)

Apoorv · Accepted Answer

I am a new member to this forum, I hope you have not forgotten about this old question :)

Solution

Store the master set in an indexed data structure --such as an array (or arraylist, if your library supports it). Let's assume you can associate an id with each set (if not, then how do you know which set to retrieve?). So, we now need a way to find out which elements of your array participate in that set and which ones don't.

Use a matrix (n x m) with n being the number of elements in your array and m being the initial number of sets. i refers to the row index and j refers to the column index.

A[i][j] = 0 if ith element is not in jth set
A[i][j] = 1 if ith element is in jth set

Do not use a simple two dimensional array, go for an ArrayList<ArrayList>. Java/C#/C++ support such generic constructs, but it shouldn't be terribly hard to do so in other languages such as Perl. In C# you can even use a DataTable.

Time to add a new set

You can add a new set in O(n) time. Simply add a new column for that set and set the appropriate rows to 1 for that column. There will be no need to sort this set as long as the original array is sorted.

Time to add a new element

In a simple sorted array, time for insertion is O(log n). In our case, we will first add element to the array (and at whatever index we added the element at, the matrix will also get an all 0 row at that index). Then we set entries in that column to 1 if the element belongs to a set. That way, the worst case runtime becomes O(log n) + O(m).

Time to fetch first N elements from a set

Pick up the column corresponding to the set in O(1) time and then pick the first N entries that are 1. This will be linear.

Time to merge two sets

Let's say we are merging sets at j1 and j2 into a third set at j3.

for (int i = 0; i < n - 1; i++) {
    A[i][j3] = A[i][j1] | A[i][j2];
}

This is again linear.

Time to remove an element

First find the element in the master array --this takes O(log n) time. Then remove it from that array and remove row at that index from the matrix.

Efficient deletions from the array

Don't simply remove, just mark them defunct. Upon a threshold number of defunct'd columns/rows, you can consolidate. Similarly, start with a high capacity initially for the arrays. Modern implementations should do this automatically though.

Data structure for an ordered set with many defined subsets; retrieve subsets in same order

Tags:

sorting

data-structures

set

html-lists

Aaron

1 Answers

Solution

Time to add a new set

Time to add a new element

Time to fetch first N elements from a set

Time to merge two sets

Time to remove an element

Efficient deletions from the array

Apoorv

Recent Activity

Donate For Us

Data structure for an ordered set with many defined subsets; retrieve subsets in same order

Tags:

sorting

data-structures

set

html-lists

Aaron

1 Answers

Solution

Time to add a new set

Time to add a new element

Time to fetch first N elements from a set

Time to merge two sets

Time to remove an element

Efficient deletions from the array

Apoorv

Related questions

Recent Activity

Donate For Us