Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Largest intersection of N sets with ability to ignore certain sets (Set compression)

Say you have N sets of unsorted characters and these sets have common characters between them. I want to factor out as many characters that I can from these sets to make them smaller. But there is a constraint on factoring characters out: the characters must be in the intersection of M sets that you choose from N. This is somewhat of a lossless set compression algorithm. The examples below are ordered sets, but this is for easy reading. Don't assume sets will be ordered.

A simple example:

S1 = a b c d
S2 = a b c e f
S3 = a f g

The answer is to intersect only S1 and S2 and factor out: a b c. This cuts out 6 characters where any other intersection combination of sets would take out less.

A tricky example:

S1 = a b c d e f g h i
S2 = j k l m n
S3 = j k l o p q
S4 = j k l
S5 = a b c d

The answer would be to ignore sets S1 and S5 and take the intersection of the remaining sets S2, S3 and S4 to get: j k l.

The reason why a b c d is not correct is because when you factor those characters out of the sets, 19 characters remain, where as when you factor j k and l out, only 18 characters remain.

Is there an algorithm to solve these kind of problems faster than exponential time? It seems like you would have to test the intersection of every set in the Power Set of the sets ({}, {S1}, {S2}, {S3}, {S1, S2}, {S1, S3}, {S2, S3}, {S1, S2, S3}) - 8 intersections to compute if there were only 3 sets.

P.S. This is not an urgent question, but I thought it was a interesting problem I came across.

like image 422
Kyle Paulsen Avatar asked Feb 11 '15 02:02

Kyle Paulsen


1 Answers

If the alphabet size is not too large...I would use Dynamic Programming to solve this problem...the running time should be O(S*2^n), S = # of sets, n = # of alphabets

Define DP(i, bitmask) be the maximum # of characters could be canceled for any subset within set-0 to set-i, using this bitmask

For example, we have 3 sets and 5 alphabets {a,b,c,d,e} now

S0 = {a,d,e}, S1 = {b,c,e}, S2 = {a,c,e}

Try to use 0-1 bit to mask each set:

S0 = 11001 = 25, S1 = 10110 = 22, S2 = 10101 = 21

There is total 2^5 different possible masks and we will loop through all of them when calculating DP(i, bitmask)

Now initialize with DP(0, x) (i.e. simply fill the # of 1-bit of x) and use following transition to fill DP(i,x) for i > 0:

DP(i, x) = DP(i-1,x) + { # of 1-bit of x if (Si & x == x); 0 otherwise} Si is the bitmask of the Set i, & is bitwise and operation

The answer is the maximum of DP(S-1, x) for all x

This approach can find all possible solutions if there are many of them, below are sample code in C++ solving the example above:

#include<cstdio>
#include<cstdlib>
#include<algorithm>
using namespace std;

int s[3] = {25,22,21};
int dp[5][1<<5] = {0};

int bits(int x){
    int cnt = 0;
    while(x){ cnt += (x&1); x>>=1;}
    return cnt;
}

int main() {
    for(int i=0; i< (1<<5); i++) if((s[0]&i) == i){ dp[0][i] = bits(i); }

    for(int i=1; i<3;i++)
        for(int j=0; j< (1<<5); j++){
            dp[i][j] = dp[i-1][j];
            if((s[i]&j) == j) {dp[i][j] = max(bits(j), dp[i-1][j]+ bits(j));      }
        }


    int x = -1;
    for(int i=0; i< (1<<5); i++){
        x = max(x, dp[2][i]);
        printf("Maximum cancelled: %d,  current DP: %d, bitmask: %d\n", x, dp[2][i], i);
    }
    return 0;
}

Whenever the output of the DP State is equal to the maximum number cancelled, its bitmask is the corresponding solution, you can easily convert back to English characters, i.e {c,e} or {a,e} in the above example

EDITED: To reply the comment below, I try to respond parts by parts here:

Q1. Is it still exponential? Only from exponential to the # of set transfer to the # of alphabets?

A1. Yes it is. I have this thought as I think in reality the alphabet size would not be too large...but in theory yes it is still exponential time

Q2. Is this problem NP Complete?

A2. Okay, this is the interesting part, here is my thought, if I am wrong please correct me, I think YES it is NP Complete. My thought is to model this problem in to a graph problem, see the image below (bare my poor mspaint skill) enter image description here

We got a bipartite graph, and in the same sense to your original problem, we now want to find the Maximum Complete Subgraph -- That's a Clique in general graph which is a well-known NP Complete problem.

And then I think, it is a Bipartite Graph! Maybe Clique in bipartite graph is not NP Complete, yet thanks to Google, I found another problem Complete Bipartite Graph and focus on the first property in the page:

Given a bipartite graph, testing whether it contains a complete bipartite subgraph Ki,i for a parameter i is an NP-complete problem.

To conclude, I think this is NP-Complete

Q3. How to come up with such DP solution?

A3. Combine with A1., many NPC Problems actually has a Pseudo-Polynomial Solution, and O(x * 2^y) is quite a common form as far as I know, an example would be Hamiltonian Path, which can be solved in O(n^2 * 2^n). As an extra bit, if you ask myself, I have similar idea of Knapsack Problem as well when thinking this DP solution...but that's a bit non-related to your question...

like image 130
shole Avatar answered Nov 08 '22 06:11

shole