Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to find the list of attributes which would yield to the greatest sum without brute forcing?

Tags:

algorithm

I have about 2M records stored in a table. Each record has a number and about 5K boolean attributes.

So the table looks something like this.

3, T, F, T, F, T, T, ...
29, F, F, T, F, T, T, ...
...
-87, T, F, T, F, T, T, ...
98, F, F, T, F, F, T, ...

And I defined SUM(A, B) as the sum of the numbers where Ath and Bth attributes are true. For example, from the sample data above: SUM(1, 3) = 3 + ... + (-87) because the 1st and the 3rd attributes are T for 3 and -87

3, (T), F, (T), F, T, T, ...
29, (F), F, (T), F, T, T, ...
...
-87, (T), F, (T), F, T, T, ...
98, (F), F, (T), F, F, T, ...

And SUM() can take any number of parameters: SUM(1) and SUM(5, 7, ..., 3455) are all possible.

Are there some smart algorithms for finding a list of attributes L where SUM(L) would yields to the maximum result? Obviously, brute forcing is not feasible for this large data set.

It would be awesome if there is a way to find not only the maximum but top N lists.

EDIT It seems like it is not possible to find THE answer without brute forcing. If I changed the question to find a "good estimation", would there be a good way to do it? Or, what if I said the cardinality of L is fixed to something like 10, would there be a way to calculate the L? I would be happy with any.

like image 336
Bing Avatar asked Jul 07 '13 08:07

Bing


1 Answers

Unfortunately, this problem is NP-complete. Your options are limited to finding a good but non-maximal solution with an approximation algorithm, or using branch-and-bound and hoping that you don't hit exponential runtime.

Proof of NP-completeness

To prove that your problem is NP-complete, we reduce the set cover problem to your problem. Suppose we have a set U of N elements, and a set S of M subsets of U, where the union of all sets in S is U. The set cover problem asks for the smallest subset T of S such that every element of U is contained in an element of T. If we had a polynomial-time algorithm to solve your problem, we could solve the set cover problem as follows:

First, construct a table with M+N rows and M attributes. The first N rows are "element" rows, each corresponding to an element of U. These have value "negative enough"; -M-1 should be enough. For element row i, the jth attribute is true if the corresponding element is not in the jth set in S.

The last M rows are "set" rows, each corresponding to a set in S. These have value 1. For set row N+i, the ith attribute is false and all others are true.

The values of the element rows are small enough that any choice of attributes that excludes all element rows beats any choice of attributes that includes any element row. Since the union of all sets in S is U, picking all attributes excludes all element rows, so the best choice of attributes is the one that includes the most set rows without including any element rows. By the construction of the table, a choice of attributes will exclude all element rows if the union of the corresponding sets is U, and if it does, its score will be better the fewer attributes it includes. Thus, the best choice of attributes corresponds directly to a minimum cover of S.

If we had a good algorithm to pick a choice of attributes that produces the maximal sum, we could apply it to this table to generate the minimum cover of an arbitrary S. Thus, your problem is as hard as the NP-complete set cover problem, and you should not waste your time trying to come up with an efficient algorithm to generate the perfect choice of attributes.

like image 115
user2357112 supports Monica Avatar answered Oct 07 '22 01:10

user2357112 supports Monica