I have a large array (approx 250,000 x 10). Each row contains 1s or -1s. E.g:
data(1, :) = [1, -1, -1, -1, -1, -1, -1, -1, 1, -1];
I need to select sets of n rows, such that the mean of the absolute sums of the columns is minimized (as close to zero as possible). So, in this toy example, where n=2:
[ 1 1 1 1]
[-1 -1 -1 -1]
[-1 1 -1 1]
I would select rows 1 and 2, as they sum to [0 0 0 0] (mean 0), which is the minimum possible when n=2.
I tried the method suggested below (of finding complementary pairs), but for my dataset, this can only form a balanced subset of 23k rows. So, I need an approximation which generates a subset of size n rows, but with the minimum means of absolute sums of the columns.
The best approach I have found so far is as follows: pick a starting subset, iteratively add each row from the remainder to the base, and keep it if it improves the mean of the absolute sums of the columns. This is very crude and I am sure there are better ways. It is also prone to getting stuck at false minima, so a contingency needs to be added:
shuffle = randperm(size(data));
data_shuffled = data(shuffle, :);
base = data_shuffled(1:30000, :);
pool = data_shuffled(30001:end, :);
best_mean = mean(abs(sum(base, 1)));
best_matrix = base;
n = 100000;
for k = 1:20
for i = 1:size(pool, 1)
temp = pool (i, :);
if(~isnan(temp(1)))
temp_sum = sum(temp, 1);
new_sum = temp_sum + sum(best, 1);
temp_mean = mean(abs(new_sum));
if(temp_mean < best_mean)
best_mean = temp_mean;
best_matrix = vertcat(best_matrix, temp);
pool(i, :) = NaN(1, 10);
end
end
end
if(size(best_matrix, 1) > n)
return
end
end
This achieves a mean of the absolute sums of the columns of ~17,000, which is not too bad. Repeating with different seeds will probably improve it a bit.
Ideally, rather than just adding the new element to the end of best_matrix, I would swap it with some element, so as to achieve the best improvement.
Update: I don't want to give out specific details of the dataset because all solutions should be applicable to any matrices in the specified format.
Thank you to everyone that contributed!
S = sum( A , 'all' ) computes the sum of all elements of A . This syntax is valid for MATLAB® versions R2018b and later. S = sum( A , dim ) returns the sum along dimension dim . For example, if A is a matrix, then sum(A,2) is a column vector containing the sum of each row.
L = length( X ) returns the length of the largest array dimension in X . For vectors, the length is simply the number of elements. For arrays with more dimensions, the length is max(size(X)) . The length of an empty array is zero.
Description: In addition to being the symbol for matrix multiplication, the asterisk * is used as a wildcard character.
example. C = A . * B multiplies arrays A and B by multiplying corresponding elements. The sizes of A and B must be the same or be compatible. If the sizes of A and B are compatible, then the two arrays implicitly expand to match each other.
What about the following approach. With 10 columns only having +1 and -1 values, there are only 1024 different rows possible. So our data is now:
a(i,j)
with -1 and +1 coefficients. This matrix has all different possible (unique) rows. v(i)
with how many times we saw row i.Now we can write a simple Mixed Integer Programming problem as follows:
Notes:
As other had stated an optimal solution might be impossible, so I'll focus on specific cases.
First I assume independence of the distributions of each columns.
Then I work on accumulator space to reduce the size of data and speed the code.
I do it by taking each -1
as 0
and considering each row as a number, and add 1 to avoid working with 0 as an index, such as:
data(1,:)=[-1 1 -1 1 -1 1 -1 1 -1 1] -> '0101010101' -> 341 -> 342
With this we can accumulate the data as:
function accum=mat2accum(data)
[~,n]=size(data);
indexes=bin2dec(num2str((data+1)/2))+1;
accum=accumarray(indexes,1,[2^n 1]);
First case I consider is when the sum of each column is a small number compared to the size of the data, this means that there is a similar amount of 1's and -1's in all columns.
sum(data) << size(data)
For this case you can find all the pairs that cancel each other like:
data(1,:)=[-1 1 -1 1 -1 1 -1 1 -1 1] -> '0101010101' -> 341 -> 342
data(2,:)=[1 -1 1 -1 1 -1 1 -1 1 -1] -> '1010101010' -> 682 -> 683
And we know that each pair will be in the mirrored position in the accumulator index, so we can get all possible pairs with:
function [accumpairs, accumleft]=getpairs(accum)
accumpairs=min([accum,accum(end:-1:1)],[],2);
accumleft=accum-accumpairs;
With random generated data I was able to get >100k pairs in a set of 250k rows, and a subset of pairs will have a sum equal to zero in each column. So if 1's and -1's are equally distributed, this might be enough.
Second case I considered was when the sum of each column is far from zero, meaning that there is a big disproportion between 1's and -1's.
abs(sum(data)) >> 0
By inverting each column where the sum is negative, which would not affect the data since at the end it is possible to invert those columns again. Its is possible to force the disproportion to be more 1's than -1's. And by extracting first the possible pairs of this data, the disproportion is even more pronounced.
With the data prepared as such, it is possible to treat the problem as to minimize the number of 1's in the set required. For this first we randomize the possible indexes, then we calculate and sort the Hamming weight (number of 1's in the binary representation) of each index, and then gather the data with the smallest Hamming weight possible.
function [accumlast,accumleft]=resto(accum,m)
[N,~]=size(accum);
columns=log2(N);
indexes=randperm(N)'; %'
[~,I]=sort(sum((double(dec2bin(indexes-1,columns))-48),2));
accumlast=zeros(N,1);
for k=indexes(I)' %'
accumlast(k)=accum(k);
if sum(accumlast)>=m
break
end
end
accumleft=accum-accumlast;
With randomly generated data where there were around 2 times more 1's than -1's, and the sum of each column was around 80k, I can find a subset of 100k data with sum of around 5k in each column.
Third case, is when some columns sum are close to zero and some aren't. In this case you separate the columns into the ones with big sum and the ones with small sum, then sort the data by the hamming weight of the big sum columns, and get the pairs of the small sum columns inside each of the big columns index. This will create a matrix with the number of possible pairs, the number of unpairbale rows, and the sum of the unpairable rows of the small columns, for each index of the big sum columns.
Now you can use that information to keep a running sum and see which indexes of the big sum columns to add to your subset, and also if it's worth to add the parable or the unpairable data in each case.
function [accumout,accumleft]=getseparated(accum, bigcol, smallcol, m)
data=accum2mat(accum);
'indexing'
bigindex=bin2dec(num2str((data(:,bigcol)+1)/2))+1;
[~,bn]=size(bigcol);
[~,sn]=size(smallcol);
'Hamming weight'
b_ind=randperm(2^bn)'; %'
[~,I]=sort(sum((double(dec2bin(b_ind-1,bn))-48),2));
temp=zeros(2^bn,4+sn);
w=waitbar(0,'Processing');
for k=1:2^bn;
small_data=data(bigindex==b_ind(I(k)),smallcol);
if small_data
small_accum=mat2accum(small_data);
[small_accumpairs, small_accum]=getpairs(small_accum);
n_pairs=sum(small_accumpairs);
n_non_pairs=sum(small_accum);
sum_non_pairs=sum(accum2mat(small_accum));
else
n_pairs=0;
n_non_pairs=0;
sum_non_pairs=zeros(1,sn);
end
ham_weight=sum((double(dec2bin(b_ind(I(k))-1,bn))-48),2);
temp(k,:)=[b_ind(I(k)) n_pairs n_non_pairs ham_weight sum_non_pairs];
waitbar(k/2^bn);
end
close(w)
pair_ind=1;
nonpair_ind=1;
runningsum=[0 0 0 0 0 0 0 0 0 0];
temp2=zeros(2^bn,2);
while sum(sum(temp2))<=m
if pair_ind<=2^bn
pairsum=[(((double(dec2bin((temp(pair_ind,1)-1),bn))-48)*2)-1)*temp(pair_ind,2) zeros(1,sn)];
end
if nonpair_ind<=2^bn
nonpairsum=[(((double(dec2bin((temp(nonpair_ind,1)-1),bn))-48)*2)-1)*temp(nonpair_ind,3) temp(nonpair_ind,5:5+sn-1)];
end
if nonpair_ind==(2^bn)+1
temp2(pair_ind,1)=temp(pair_ind,2);
runningsum=runningsum+pairsum;
pair_ind=pair_ind+1;
elseif pair_ind==(2^bn)+1
temp2(nonpair_ind,2)=temp(nonpair_ind,3);
runningsum=runningsum+nonpairsum;
nonpair_ind=nonpair_ind+1;
elseif sum(abs(runningsum+pairsum))<=sum(abs(runningsum+nonpairsum))
temp2(pair_ind,1)=temp(pair_ind,2);
runningsum=runningsum+pairsum;
pair_ind=pair_ind+1;
elseif sum(abs(runningsum+pairsum))>sum(abs(runningsum+nonpairsum))
temp2(nonpair_ind,2)=temp(nonpair_ind,3);
runningsum=runningsum+nonpairsum;
nonpair_ind=nonpair_ind+1;
end
end
accumout=zeros(2^(bn+sn),1);
for k=1:2^bn
if temp2(k,:)
small_data=data(bigindex==temp(k,1),smallcol);
if small_data
small_accum=mat2accum(small_data);
[small_accumpairs, small_accum]=getpairs(small_accum);
pairs=accum2mat(small_accumpairs);
non_pairs=accum2mat(small_accum);
else
pairs=zeros(1,sn);
non_pairs=zeros(1,sn);
end
if temp2(k,1)
datatemp=zeros(temp2(k,1),sn+bn);
datatemp(:,bigcol)=((double(dec2bin(ones(temp2(k,1),1)*(temp(k,1)-1),bn))-48)*2)-1;
datatemp(:,smallcol)=pairs;
accumout=accumout+mat2accum(datatemp);
end
if temp2(k,2)
datatemp=zeros(temp2(k,2),sn+bn);
datatemp(:,bigcol)=((double(dec2bin(ones(temp2(k,2),1)*(temp(k,1)-1),bn))-48)*2)-1;
datatemp(:,smallcol)=non_pairs;
accumout=accumout+mat2accum(datatemp);
end
end
end
accumleft=accum-accumout;
With data composed of 5 columns of the first case and 5 columns of the second case, it was possible to constructt a set of 100k rows with <1k of sum in the small sum columns and between 10k and 30k in the big ones.
It's worth noting that the size of data, size of the required subset, and distributtion of 1's and -1's, will have a great effect in the performance of this algorithmms.
This problem, sadly, falls outside the scope of regular (continuous) optimization. Your problem, which can be parameterized as follows:
min_{S∈S_n} Σ_{j∈S}|Σ_i data_ji|
Where S_n
is the set of n-elements combinations of indices j∈{0,...,250000}
, can also be rewritten as a very similar regular Quadratic Integer Programming problem in x
:
min_x x'* data *data' *x
0<=x<=1 and x*1=n
Where data
is your 250000*10 matrix and x
is the 250000*1 vector of combinations we're looking for. (Now we optimize the sum of squares instead of the sum of absolute values... )
This problem is proven to be NP-hard, which means to find the global minimizer, you must go through all possible combinations of n draws in 250000 possibilities, which is equal to the binomial coefficient (250000 n), which is equal to 250000!/(n!*(250000-n)!)
...
So good luck... ;)
EDIT
If you're going to solve this heuristically, since I suppose you're going to need a solution, use the heuristics here instead of your approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With