Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minimising the sum of array columns in Matlab

I have a large array (approx 250,000 x 10). Each row contains 1s or -1s. E.g:

data(1, :) = [1, -1, -1, -1, -1, -1, -1, -1, 1, -1];

I need to select sets of n rows, such that the mean of the absolute sums of the columns is minimized (as close to zero as possible). So, in this toy example, where n=2:

[ 1  1  1  1]
[-1 -1 -1 -1]
[-1  1 -1  1]

I would select rows 1 and 2, as they sum to [0 0 0 0] (mean 0), which is the minimum possible when n=2.


I tried the method suggested below (of finding complementary pairs), but for my dataset, this can only form a balanced subset of 23k rows. So, I need an approximation which generates a subset of size n rows, but with the minimum means of absolute sums of the columns.

The best approach I have found so far is as follows: pick a starting subset, iteratively add each row from the remainder to the base, and keep it if it improves the mean of the absolute sums of the columns. This is very crude and I am sure there are better ways. It is also prone to getting stuck at false minima, so a contingency needs to be added:

shuffle = randperm(size(data));
data_shuffled = data(shuffle, :);

base = data_shuffled(1:30000, :);
pool = data_shuffled(30001:end, :);

best_mean = mean(abs(sum(base, 1)));
best_matrix = base;
n = 100000;

for k = 1:20

    for i = 1:size(pool, 1)
        temp = pool (i, :);

        if(~isnan(temp(1)))
            temp_sum = sum(temp, 1);
            new_sum = temp_sum + sum(best, 1);
            temp_mean = mean(abs(new_sum));

            if(temp_mean < best_mean)
                best_mean = temp_mean;
                best_matrix = vertcat(best_matrix, temp);
                pool(i, :) = NaN(1, 10);            
            end
        end
    end

    if(size(best_matrix, 1) > n)
        return
    end

end

This achieves a mean of the absolute sums of the columns of ~17,000, which is not too bad. Repeating with different seeds will probably improve it a bit.

Ideally, rather than just adding the new element to the end of best_matrix, I would swap it with some element, so as to achieve the best improvement.

Update: I don't want to give out specific details of the dataset because all solutions should be applicable to any matrices in the specified format.

Thank you to everyone that contributed!

like image 1000
Chris Parry Avatar asked Feb 22 '16 04:02

Chris Parry


People also ask

How do I sum all columns in an array in MATLAB?

S = sum( A , 'all' ) computes the sum of all elements of A . This syntax is valid for MATLAB® versions R2018b and later. S = sum( A , dim ) returns the sum along dimension dim . For example, if A is a matrix, then sum(A,2) is a column vector containing the sum of each row.

How do you find the column length of an array in MATLAB?

L = length( X ) returns the length of the largest array dimension in X . For vectors, the length is simply the number of elements. For arrays with more dimensions, the length is max(size(X)) . The length of an empty array is zero.

Why do we use * in MATLAB?

Description: In addition to being the symbol for matrix multiplication, the asterisk * is used as a wildcard character.

What is a * b in MATLAB?

example. C = A . * B multiplies arrays A and B by multiplying corresponding elements. The sizes of A and B must be the same or be compatible. If the sizes of A and B are compatible, then the two arrays implicitly expand to match each other.


3 Answers

What about the following approach. With 10 columns only having +1 and -1 values, there are only 1024 different rows possible. So our data is now:

  1. a 1024 x 10 matrix a(i,j) with -1 and +1 coefficients. This matrix has all different possible (unique) rows.
  2. a vector v(i) with how many times we saw row i.

Now we can write a simple Mixed Integer Programming problem as follows:

enter image description here

Notes:

  • We only have 1024 integer variables
  • We set an upper bound on x(i) that indicates how many times a row can be selected
  • We use a so-called variable splitting technique to model the absolute values and keep the model linear
  • Minimizing the mean is the same as minimizing the sum (the difference is a constant factor)
  • The line about optcr tells the MIP solver to find proven global optimal solutions
  • A good MIP solver should be able to find solutions very quickly. I tested with some random data using 250k rows and N=100. I actually believe this is an easy problem.
  • To reiterate: this method delivers proven global optimal solutions.
  • Some more details can be found here.
like image 81
Erwin Kalvelagen Avatar answered Oct 29 '22 03:10

Erwin Kalvelagen


As other had stated an optimal solution might be impossible, so I'll focus on specific cases.

First I assume independence of the distributions of each columns.

Then I work on accumulator space to reduce the size of data and speed the code.

I do it by taking each -1 as 0 and considering each row as a number, and add 1 to avoid working with 0 as an index, such as:

data(1,:)=[-1 1 -1 1 -1 1 -1 1 -1 1] -> '0101010101' -> 341 -> 342

With this we can accumulate the data as:

function accum=mat2accum(data)

[~,n]=size(data);
indexes=bin2dec(num2str((data+1)/2))+1;
accum=accumarray(indexes,1,[2^n 1]);

First case I consider is when the sum of each column is a small number compared to the size of the data, this means that there is a similar amount of 1's and -1's in all columns.

sum(data) << size(data)

For this case you can find all the pairs that cancel each other like:

data(1,:)=[-1 1 -1 1 -1 1 -1 1 -1 1] -> '0101010101' -> 341 -> 342
data(2,:)=[1 -1 1 -1 1 -1 1 -1 1 -1] -> '1010101010' -> 682 -> 683

And we know that each pair will be in the mirrored position in the accumulator index, so we can get all possible pairs with:

function [accumpairs, accumleft]=getpairs(accum)

accumpairs=min([accum,accum(end:-1:1)],[],2);
accumleft=accum-accumpairs;

With random generated data I was able to get >100k pairs in a set of 250k rows, and a subset of pairs will have a sum equal to zero in each column. So if 1's and -1's are equally distributed, this might be enough.


Second case I considered was when the sum of each column is far from zero, meaning that there is a big disproportion between 1's and -1's.

abs(sum(data)) >> 0

By inverting each column where the sum is negative, which would not affect the data since at the end it is possible to invert those columns again. Its is possible to force the disproportion to be more 1's than -1's. And by extracting first the possible pairs of this data, the disproportion is even more pronounced.

With the data prepared as such, it is possible to treat the problem as to minimize the number of 1's in the set required. For this first we randomize the possible indexes, then we calculate and sort the Hamming weight (number of 1's in the binary representation) of each index, and then gather the data with the smallest Hamming weight possible.

function [accumlast,accumleft]=resto(accum,m)

[N,~]=size(accum);
columns=log2(N);
indexes=randperm(N)'; %'
[~,I]=sort(sum((double(dec2bin(indexes-1,columns))-48),2));
accumlast=zeros(N,1);

for k=indexes(I)' %'
    accumlast(k)=accum(k);
    if sum(accumlast)>=m
        break
    end
end

accumleft=accum-accumlast;

With randomly generated data where there were around 2 times more 1's than -1's, and the sum of each column was around 80k, I can find a subset of 100k data with sum of around 5k in each column.


Third case, is when some columns sum are close to zero and some aren't. In this case you separate the columns into the ones with big sum and the ones with small sum, then sort the data by the hamming weight of the big sum columns, and get the pairs of the small sum columns inside each of the big columns index. This will create a matrix with the number of possible pairs, the number of unpairbale rows, and the sum of the unpairable rows of the small columns, for each index of the big sum columns.

Now you can use that information to keep a running sum and see which indexes of the big sum columns to add to your subset, and also if it's worth to add the parable or the unpairable data in each case.

function [accumout,accumleft]=getseparated(accum, bigcol, smallcol, m)

data=accum2mat(accum);

'indexing'
bigindex=bin2dec(num2str((data(:,bigcol)+1)/2))+1;
[~,bn]=size(bigcol);
[~,sn]=size(smallcol);

'Hamming weight'
b_ind=randperm(2^bn)'; %'
[~,I]=sort(sum((double(dec2bin(b_ind-1,bn))-48),2));

temp=zeros(2^bn,4+sn);

w=waitbar(0,'Processing');
for k=1:2^bn;
    small_data=data(bigindex==b_ind(I(k)),smallcol);
    if small_data
        small_accum=mat2accum(small_data);
        [small_accumpairs, small_accum]=getpairs(small_accum);
        n_pairs=sum(small_accumpairs);
        n_non_pairs=sum(small_accum);
        sum_non_pairs=sum(accum2mat(small_accum));
    else
        n_pairs=0;
        n_non_pairs=0;
        sum_non_pairs=zeros(1,sn);
    end
    ham_weight=sum((double(dec2bin(b_ind(I(k))-1,bn))-48),2);
    temp(k,:)=[b_ind(I(k)) n_pairs n_non_pairs ham_weight sum_non_pairs];
    waitbar(k/2^bn);
end

close(w)

pair_ind=1;
nonpair_ind=1;
runningsum=[0 0 0 0 0 0 0 0 0 0];
temp2=zeros(2^bn,2);

while sum(sum(temp2))<=m
     if pair_ind<=2^bn
         pairsum=[(((double(dec2bin((temp(pair_ind,1)-1),bn))-48)*2)-1)*temp(pair_ind,2) zeros(1,sn)];
     end
     if nonpair_ind<=2^bn
         nonpairsum=[(((double(dec2bin((temp(nonpair_ind,1)-1),bn))-48)*2)-1)*temp(nonpair_ind,3) temp(nonpair_ind,5:5+sn-1)];
     end
     if nonpair_ind==(2^bn)+1
         temp2(pair_ind,1)=temp(pair_ind,2);
         runningsum=runningsum+pairsum;
         pair_ind=pair_ind+1;
     elseif pair_ind==(2^bn)+1
         temp2(nonpair_ind,2)=temp(nonpair_ind,3);
         runningsum=runningsum+nonpairsum;
         nonpair_ind=nonpair_ind+1;
     elseif sum(abs(runningsum+pairsum))<=sum(abs(runningsum+nonpairsum))
         temp2(pair_ind,1)=temp(pair_ind,2);
         runningsum=runningsum+pairsum;
         pair_ind=pair_ind+1;
     elseif sum(abs(runningsum+pairsum))>sum(abs(runningsum+nonpairsum))
         temp2(nonpair_ind,2)=temp(nonpair_ind,3);
         runningsum=runningsum+nonpairsum;
         nonpair_ind=nonpair_ind+1;
     end
end

accumout=zeros(2^(bn+sn),1);

for k=1:2^bn
    if temp2(k,:)
        small_data=data(bigindex==temp(k,1),smallcol);
        if small_data
            small_accum=mat2accum(small_data);
            [small_accumpairs, small_accum]=getpairs(small_accum);
            pairs=accum2mat(small_accumpairs);
            non_pairs=accum2mat(small_accum);
        else
            pairs=zeros(1,sn);
            non_pairs=zeros(1,sn);
        end
        if temp2(k,1)
            datatemp=zeros(temp2(k,1),sn+bn);
            datatemp(:,bigcol)=((double(dec2bin(ones(temp2(k,1),1)*(temp(k,1)-1),bn))-48)*2)-1;
            datatemp(:,smallcol)=pairs;
            accumout=accumout+mat2accum(datatemp);
        end
        if temp2(k,2)
            datatemp=zeros(temp2(k,2),sn+bn);
            datatemp(:,bigcol)=((double(dec2bin(ones(temp2(k,2),1)*(temp(k,1)-1),bn))-48)*2)-1;
            datatemp(:,smallcol)=non_pairs;
            accumout=accumout+mat2accum(datatemp);
        end
    end
end

accumleft=accum-accumout;

With data composed of 5 columns of the first case and 5 columns of the second case, it was possible to constructt a set of 100k rows with <1k of sum in the small sum columns and between 10k and 30k in the big ones.

It's worth noting that the size of data, size of the required subset, and distributtion of 1's and -1's, will have a great effect in the performance of this algorithmms.

like image 20
Noel Segura Meraz Avatar answered Oct 29 '22 03:10

Noel Segura Meraz


This problem, sadly, falls outside the scope of regular (continuous) optimization. Your problem, which can be parameterized as follows:

min_{S∈S_n} Σ_{j∈S}|Σ_i data_ji|

Where S_n is the set of n-elements combinations of indices j∈{0,...,250000}, can also be rewritten as a very similar regular Quadratic Integer Programming problem in x:

min_x x'* data *data' *x
0<=x<=1 and x*1=n

Where data is your 250000*10 matrix and x is the 250000*1 vector of combinations we're looking for. (Now we optimize the sum of squares instead of the sum of absolute values... )

This problem is proven to be NP-hard, which means to find the global minimizer, you must go through all possible combinations of n draws in 250000 possibilities, which is equal to the binomial coefficient (250000 n), which is equal to 250000!/(n!*(250000-n)!)...

So good luck... ;)

EDIT

If you're going to solve this heuristically, since I suppose you're going to need a solution, use the heuristics here instead of your approach.

like image 20
reverse_engineer Avatar answered Oct 29 '22 03:10

reverse_engineer