Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure for set of (non-disjoint) sets

I'm looking for a data structure that roughly corresponds to (in Java terms) Map<Set<int>, double>. Essentially a set of sets of labeled marbles, where each set of marbles is associated with a scalar. I want it to be able to efficiently handle the following operations:

  • Add a given integer to every set.
  • Remove every set that contains (or does not contain) a given integer, or at least set the associated double to 0.
  • Union two of the maps, adding together the doubles for sets that appear in both.
  • Multiply all of the doubles by a given double.
  • Rarely, iterate over the entire map.

under the following conditions:

  • The integers will fall within a constrained range (between 1 and 10,000 or so); the exact range will be known at compile-time.
  • Most of the integers within the range (80-90%) will never be used, but which ones will not be easily determinable until the end of the calculation.
    • The number of integers used will almost always still be over 100.
  • Many of the sets will be very similar, differing only by a few elements.
  • It may be possible to identify certain groups of integers that frequently appear only in sequential order: for example, if a set contains the integers 27 and 29 then it (almost?) certainly contains 28 as well.
    • It may be possible to identify these groups prior to running the calculation.
    • These groups would typically have 100 or so integers.

I've considered tries, but I don't see a good way to handle the "remove every set that contains a given integer" operation.

The purpose of this data structure would be to represent discrete random variables and permit addition, multiplication, and scalar multiplication operations on them. Each of these discrete random variables would ultimately have been created by applying these operations to a fixed (at compile-time) set of independent Bernoulli random variables (i.e. each takes the value 1 or 0 with some probability).

The systems being modeled are close to being representable as a time-inhomogeneous Markov chains (which would of course simplify this immensely) but, unfortunately, it is essential to track the duration since various transitions.

like image 663
Alex Godofsky Avatar asked Apr 02 '14 23:04

Alex Godofsky


People also ask

Which data structure is used in disjoint-set?

In computer science, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a data structure that stores a collection of disjoint (non-overlapping) sets. Equivalently, it stores a partition of a set into disjoint subsets.

What is non disjoint-set?

A pair of sets which does not have any common element are called disjoint sets. For example, set A={2,3} and set B={4,5} are disjoint sets. But set C={3,4,5} and {3,6,7} are not disjoint as both the sets C and D are having 3 as a common element.

What are 2 ways to implement disjoint sets?

The disjoint set data structure supports following operations: Adding new sets to the disjoint set. Merging disjoint sets to a single disjoint set using Union operation.

How can you tell if the pair of sets is disjoint or not?

Two sets are said to be disjoint when they have no common element. If a collection has two or more sets, the condition of disjointness will be the intersection of the entire collection should be empty.


1 Answers

Here's a data structure, that can do all of your operations pretty efficiently:

I'm going to refer to it as a BitmapArray for this explanation.

Thinking about it, apparently for just the operations you have described a sorted array with bitmaps as keys and weights(your doubles) as values will be pretty efficient.

The bitmaps are what maintain membership in your set. Since you said the range of integers in the set are between 1-10,000, we can maintain information about any set with a bitmap of length 10,000.

It's gonna be tough sorting an array where the keys can be as big as 2^10000, but you can be smart about implementing the comparison function in the following way:

  • Iterate from left to right on the two bitmaps
  • XOR the bits on each index
  • Say you get a 1 at ith position
  • Whichever bitmap has 1 at ith position is greater
  • If you never get a 1 they're equal

I know this is still a slow comparison. But not too slow, Here's a benchmark fiddle I did on bitmaps with length 10000. This is in Javascript, if you're going to write in Java, it's going to perform even better.

    function runTest() {
    var num = document.getElementById("txtValue").value;
    num = isNaN(num * 1) ? 0 : num * 1;

    /*For integers in the range 1-10,000 the worst case for comparison are any equal integers which will cause the comparision to iterate over the whole BitArray*/
    bitmap1 = convertToBitmap(10000, num);
    bitmap2 = convertToBitmap(10000, num);

    before = new Date().getMilliseconds();
    var result = firstIsGreater(bitmap1, bitmap2, 10000);
    after = new Date().getMilliseconds();
    alert(result + " in time: " + (after-before) + " ms");

}


function convertToBitmap(size, number) {
    var bits = new Array();
    var q = number;
    do {
        bits.push(q % 2);
        q = Math.floor(q / 2);
    } while (q > 0);


    xbitArray = new Array();
    for (var i = 0; i < size; i++) {
        xbitArray.push(0);
    }

    var j = xbitArray.length - 1;
    for (var i = bits.length - 1; i >= 0; i--) {
        xbitArray[j] = bits[i];
        j--
    }
    return xbitArray;
}

function firstIsGreater(bitArray1, bitArray2, lengthOfArrays) {
    for (var i = 0; i < lengthOfArrays; i++) {
        if (bitArray1[i] ^ bitArray2[i]) {
            if (bitArray1[i]) return true;
            else return false;
        }
    }
    return false;
}

document.getElementById("btnTest").onclick = function (e) {
    runTest();
};

Also, remember that you only have to do this once, when building your BitmapArray (or while taking unions) and then it's going to become pretty efficient for the operations you'd do most often:

Note: N is the length of the BitmapArray.

Add integer to every set: Worst/best case O(N) time. Flip a 0 to 1 in each bitmap.

Remove every set that contains a given integer: Worst case O(N) time.

  • For each bitmap check the bit that represents the given integer, if 1 mark it's index.
  • Compress the array by deleting all marked indices.

If you're okay with just setting the weights to 0 it'll be even more efficient. This also makes it very easy if you want to remove all sets that have any element in a given set.

Union of two maps: Worst case O(N1+N2) time. Just like merging two sorted arrays, except you have to be smart about comparisons once more.

Multiply all of the doubles by a given double: Worst/best case O(N) time. Iterate and multiply each value by the input double.

Iterate over the BitmapArray: Worst/best case O(1) time for next element.

like image 73
aa333 Avatar answered Oct 14 '22 05:10

aa333