Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

An efficient code to determine if a set is a subset of another set

I am looking for an efficient way to determine if a set is a subset of another set in Matlab or Mathematica.

Example: Set A = [1 2 3 4] Set B = [4 3] Set C = [3 4 1] Set D = [4 3 2 1]

The output should be: Set A

Sets B and C belong to set A because A contains all of their elements, therefore, they can be deleted (the order of elements in a set doesn't matter). Set D has the same elements as set A and since set A precedes set D, I would like to simply keep set A and delete set D.

So there are two essential rules: 1. Delete a set if it is a subset of another set 2. Delete a set if its elements are the same as those of a preceding set

My Matlab code is not very efficient at doing this - it mostly consists of nested loops.

Suggestions are very welcome!

Additional explanation: the issue is that with a large number of sets there will be a very large number of pairwise comparisons.

like image 339
Eduardas Avatar asked Oct 12 '10 21:10

Eduardas


2 Answers

You will likely want to take a look at the built-in set operation functions in MATLAB. Why reinvent the wheel if you don't have to? ;)

HINT: The ISMEMBER function may be of particular interest to you.

EDIT:

Here's one way you can approach this problem using nested loops, but setting them up to try and reduce the number of potential iterations. First, we can use the suggestion in Marc's comment to sort the list of sets by their number of elements so that they are arranged largest to smallest:

setList = {[1 2 3 4],...  %# All your sets, stored in one cell array
           [4 3],...
           [3 4 1],...
           [4 3 2 1]};
nSets = numel(setList);                       %# Get the number of sets
setSizes = cellfun(@numel,setList);           %# Get the size of each set
[temp,sortIndex] = sort(setSizes,'descend');  %# Get the sort index
setList = setList(sortIndex);                 %# Sort the sets

Now we can set up our loops to start with the smallest sets at the end of the list and compare them first to the largest sets at the start of the list to increase the odds we will find a superset quickly (i.e. we're banking on larger sets being more likely to contain smaller sets). When a superset is found, we remove the subset from the list and break the inner loop:

for outerLoop = nSets:-1:2
  for innerLoop = 1:(outerLoop-1)
    if all(ismember(setList{outerLoop},setList{innerLoop}))
      setList(outerLoop) = [];
      break;
    end
  end
end

After running the above code, setList will have all sets removed from it that are either subsets or duplicates of other sets preceding them in the list.

In the best case scenario (e.g. the sample data in your question) the inner loop breaks after the first iteration every time, performing only nSets-1 set comparisons using ISMEMBER. In the worst case scenario the inner loop never breaks and it will perform (nSets-1)*nSets/2 set comparisons.

like image 112
gnovice Avatar answered Nov 08 '22 09:11

gnovice


I think the question you mean to ask is "given a list of sets, pick out the set that contains all of the others". There are a bunch of edge cases where I don't know what output you would want (e.g. A = { 1, 2 } and B = { 3, 4 }), so you need to clarify a lot.

However, to answer the question you did ask, about set containment, you can use set difference (equivalently complement wrt another set). In Mathematica, this sort of thing:

setA = {1, 2, 3, 4};
setB = {4, 3};
setC = {3, 4, 1};
setD = {4, 3, 2, 1};
Complement[setD, setA] == {}
 True

indicates setD is a subset of setA.

like image 43
Jonathan Avatar answered Nov 08 '22 08:11

Jonathan