Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most common subset of size k

Suppose you have a list of subsets S1,...,Sn of the integer range R={1,2,...,N}, and an integer k. Is there an efficient way to find a subset C of R of size k such that C is a subset of a maximal number of the Si?

As an example, let R={1,2,3,4} and k=2

S1={1,2,3}
S2={1,2,3}
S3={1,2,4}
S4={1,3,4}

Then I want to return either C={1,2} or C={1,3} (doesn't matter which).

like image 953
mitchus Avatar asked Feb 17 '12 12:02

mitchus


2 Answers

I think your problem is NP-Hard. Consider the bipartite graph with the left nodes being your sets and the right nodes being the integers {1, ..., N}, with an edge between two nodes if the set contains the integer. Then, finding a common subset of size k, which is a subset of a maximal number of the Si, is equivalent to finding a complete bipartite subgraph K(i, k) with maximal number of edges i*k. If you could do this in polynomial time, then, you could find the complete bipartite subgraph K(i, j) with maximal number of edges i*j in polynomial time, by trying for each fixed k. But this problem in NP-Complete (Complete bipartite graph).

So, unless P=NP, your problem does not have a polynomial time algorithm.

like image 152
Edouard Avatar answered Oct 07 '22 18:10

Edouard


Assuming I understand your question I believe this is straightforward for fairly small sets.

I will use Mathematica code for illustration, but the concept is universal.

I generate 10 random subsets of length 4, from the set {1 .. 8}:

ss = Subsets[Range@8, {4}] ~RandomSample~ 10
{{1, 3, 4, 6}, {2, 6, 7, 8}, {3, 5, 6, 7}, {2, 4, 6, 7}, {1, 4, 5, 8},
 {2, 4, 6, 8}, {1, 2, 3, 8}, {1, 6, 7, 8}, {1, 2, 4, 7}, {1, 2, 5, 7}}

I convert these to a binary array of the presence of each number in each subset:

a = Normal@SparseArray[Join @@ MapIndexed[Tuples[{##}] &, ss] -> 1];

Grid[a]

Mathematica graphics

That is ten columns for ten subsets, and eight rows for elements {1 .. 8}.

Now generate all possible target subsets (size 3):

keys = Subsets[Union @@ ss, {3}];

Take a "key" and extract those rows from the array and do a BitAnd operation (return 1 iff all columns equal 1), then count the number of ones. For example, for key {1, 6, 8} we have:

a[[{1, 6, 8}]]

Mathematica graphics

After BitAnd:

Mathematica graphics

Do this for each key:

counts = Tr[BitAnd @@ a[[#]]] & /@ keys;

Then find the position(s) of the maximum element of that list, and extract the corresponding parts of keys:

keys ~Extract~ Position[counts, Max@counts]
{{1, 2, 7}, {2, 4, 6}, {2, 4, 7}, {2, 6, 7}, {2, 6, 8}, {6, 7, 8}}

With adequate memory this process works quickly for a larger set. Starting with 50,000 randomly selected subsets of length 7 from {1 .. 30}:

ss = Subsets[Range@30, {7}] ~RandomSample~ 50000;

The maximum sub-subsets of length 4 are calculated in about nine seconds:

AbsoluteTiming[
  a = Normal@SparseArray[Join @@ MapIndexed[Tuples[{##}] &, ss] -> 1];
  keys = Subsets[Union @@ ss, {4}];
  counts = Tr[BitAnd @@ a[[#]]] & /@ keys;
  keys~Extract~Position[counts, Max@counts]
]
 {8.8205045, {{2, 3, 4, 20},
              {7, 10, 15, 18},
              {7, 13, 16, 26},
              {11, 21, 26, 28}}}

I should add that Mathematica is a high level language and these operations are on generic objects, therefore if this is done truly at the binary level this should be much faster, and more memory efficient.

like image 25
Mr.Wizard Avatar answered Oct 07 '22 19:10

Mr.Wizard