I have a large array (approx 250,000 x 10). Each row contains 1s or -1s. E.g: <pre class="prettyprint"><code>data(1, :) = [1, -1, -1, -1, -1, -1, -1, -1, 1, -1]; </code></pre> I need to select sets of n rows, such that the mean of the absolute sums of the columns is minimized (as close to zero as possible). So, in this toy example, where n=2: <pre class="prettyprint"><code>[ 1 1 1 1] [-1 -1 -1 -1] [-1 1 -1 1] </code></pre> I would select rows 1 and 2, as they sum to [0 0 0 0] (mean 0), which is the minimum possible when n=2. <hr> I tried the method suggested below (of finding complementary pairs), but for my dataset, this can only form a balanced subset of 23k rows. So, I need an approximation which generates a subset of size n rows, but with the minimum means of absolute sums of the columns. The best approach I have found so far is as follows: pick a starting subset, iteratively add each row from the remainder to the base, and keep it if it improves the mean of the absolute sums of the columns. This is very crude and I am sure there are better ways. It is also prone to getting stuck at false minima, so a contingency needs to be added: <pre class="prettyprint"><code>shuffle = randperm(size(data)); data_shuffled = data(shuffle, :); base = data_shuffled(1:30000, :); pool = data_shuffled(30001:end, :); best_mean = mean(abs(sum(base, 1))); best_matrix = base; n = 100000; for k = 1:20 for i = 1:size(pool, 1) temp = pool (i, :); if(~isnan(temp(1))) temp_sum = sum(temp, 1); new_sum = temp_sum + sum(best, 1); temp_mean = mean(abs(new_sum)); if(temp_mean < best_mean) best_mean = temp_mean; best_matrix = vertcat(best_matrix, temp); pool(i, :) = NaN(1, 10); end end end if(size(best_matrix, 1) > n) return end end </code></pre> This achieves a mean of the absolute sums of the columns of ~17,000, which is not too bad. Repeating with different seeds will probably improve it a bit. Ideally, rather than just adding the new element to the end of best_matrix, I would swap it with some element, so as to achieve the best improvement. Update: I don't want to give out specific details of the dataset because all solutions should be applicable to any matrices in the specified format. Thank you to everyone that contributed!

What about the following approach. With 10 columns only having +1 and -1 values, there are only 1024 different rows possible. So our data is now: <ol> <li>a 1024 x 10 matrix <code>a(i,j)</code> with -1 and +1 coefficients. This matrix has all different possible (unique) rows. </li> <li>a vector <code>v(i)</code> with how many times we saw row i.</li> </ol> Now we can write a simple Mixed Integer Programming problem as follows: <img src="https://i.stack.imgur.com/iJFCD.gif" alt="enter image description here"> Notes: <ul> <li>We only have 1024 integer variables</li> <li>We set an upper bound on x(i) that indicates how many times a row can be selected</li> <li>We use a so-called variable splitting technique to model the absolute values and keep the model linear</li> <li>Minimizing the mean is the same as minimizing the sum (the difference is a constant factor)</li> <li>The line about optcr tells the MIP solver to find proven global optimal solutions</li> <li>A good MIP solver should be able to find solutions very quickly. I tested with some random data using 250k rows and N=100. I actually believe this is an easy problem.</li> <li>To reiterate: this method delivers proven global optimal solutions.</li> <li>Some more details can be found here.</li> </ul>

Minimising the sum of array columns in Matlab

Tags:

arrays

algorithm

optimization

statistics

matlab

I have a large array (approx 250,000 x 10). Each row contains 1s or -1s. E.g:

data(1, :) = [1, -1, -1, -1, -1, -1, -1, -1, 1, -1];

I need to select sets of n rows, such that the mean of the absolute sums of the columns is minimized (as close to zero as possible). So, in this toy example, where n=2:

[ 1  1  1  1]
[-1 -1 -1 -1]
[-1  1 -1  1]

I would select rows 1 and 2, as they sum to [0 0 0 0] (mean 0), which is the minimum possible when n=2.

I tried the method suggested below (of finding complementary pairs), but for my dataset, this can only form a balanced subset of 23k rows. So, I need an approximation which generates a subset of size n rows, but with the minimum means of absolute sums of the columns.

The best approach I have found so far is as follows: pick a starting subset, iteratively add each row from the remainder to the base, and keep it if it improves the mean of the absolute sums of the columns. This is very crude and I am sure there are better ways. It is also prone to getting stuck at false minima, so a contingency needs to be added:

shuffle = randperm(size(data));
data_shuffled = data(shuffle, :);

base = data_shuffled(1:30000, :);
pool = data_shuffled(30001:end, :);

best_mean = mean(abs(sum(base, 1)));
best_matrix = base;
n = 100000;

for k = 1:20

    for i = 1:size(pool, 1)
        temp = pool (i, :);

        if(~isnan(temp(1)))
            temp_sum = sum(temp, 1);
            new_sum = temp_sum + sum(best, 1);
            temp_mean = mean(abs(new_sum));

            if(temp_mean < best_mean)
                best_mean = temp_mean;
                best_matrix = vertcat(best_matrix, temp);
                pool(i, :) = NaN(1, 10);            
            end
        end
    end

    if(size(best_matrix, 1) > n)
        return
    end

end

This achieves a mean of the absolute sums of the columns of ~17,000, which is not too bad. Repeating with different seeds will probably improve it a bit.

Ideally, rather than just adding the new element to the end of best_matrix, I would swap it with some element, so as to achieve the best improvement.

Update: I don't want to give out specific details of the dataset because all solutions should be applicable to any matrices in the specified format.

Thank you to everyone that contributed!

1000

asked Feb 22 '16 04:02

Chris Parry

3 Answers

What about the following approach. With 10 columns only having +1 and -1 values, there are only 1024 different rows possible. So our data is now:

a 1024 x 10 matrix a(i,j) with -1 and +1 coefficients. This matrix has all different possible (unique) rows.
a vector v(i) with how many times we saw row i.

Now we can write a simple Mixed Integer Programming problem as follows:

enter image description here

Notes:

We only have 1024 integer variables
We set an upper bound on x(i) that indicates how many times a row can be selected
We use a so-called variable splitting technique to model the absolute values and keep the model linear
Minimizing the mean is the same as minimizing the sum (the difference is a constant factor)
The line about optcr tells the MIP solver to find proven global optimal solutions
A good MIP solver should be able to find solutions very quickly. I tested with some random data using 250k rows and N=100. I actually believe this is an easy problem.
To reiterate: this method delivers proven global optimal solutions.
Some more details can be found here.

answered Oct 29 '22 03:10

Erwin Kalvelagen

As other had stated an optimal solution might be impossible, so I'll focus on specific cases.

First I assume independence of the distributions of each columns.

Then I work on accumulator space to reduce the size of data and speed the code.

I do it by taking each -1 as 0 and considering each row as a number, and add 1 to avoid working with 0 as an index, such as:

data(1,:)=[-1 1 -1 1 -1 1 -1 1 -1 1] -> '0101010101' -> 341 -> 342

With this we can accumulate the data as:

function accum=mat2accum(data)

[~,n]=size(data);
indexes=bin2dec(num2str((data+1)/2))+1;
accum=accumarray(indexes,1,[2^n 1]);

First case I consider is when the sum of each column is a small number compared to the size of the data, this means that there is a similar amount of 1's and -1's in all columns.

sum(data) << size(data)

For this case you can find all the pairs that cancel each other like:

data(1,:)=[-1 1 -1 1 -1 1 -1 1 -1 1] -> '0101010101' -> 341 -> 342
data(2,:)=[1 -1 1 -1 1 -1 1 -1 1 -1] -> '1010101010' -> 682 -> 683

And we know that each pair will be in the mirrored position in the accumulator index, so we can get all possible pairs with:

function [accumpairs, accumleft]=getpairs(accum)

accumpairs=min([accum,accum(end:-1:1)],[],2);
accumleft=accum-accumpairs;

With random generated data I was able to get >100k pairs in a set of 250k rows, and a subset of pairs will have a sum equal to zero in each column. So if 1's and -1's are equally distributed, this might be enough.

Second case I considered was when the sum of each column is far from zero, meaning that there is a big disproportion between 1's and -1's.

abs(sum(data)) >> 0

By inverting each column where the sum is negative, which would not affect the data since at the end it is possible to invert those columns again. Its is possible to force the disproportion to be more 1's than -1's. And by extracting first the possible pairs of this data, the disproportion is even more pronounced.

With the data prepared as such, it is possible to treat the problem as to minimize the number of 1's in the set required. For this first we randomize the possible indexes, then we calculate and sort the Hamming weight (number of 1's in the binary representation) of each index, and then gather the data with the smallest Hamming weight possible.

function [accumlast,accumleft]=resto(accum,m)

[N,~]=size(accum);
columns=log2(N);
indexes=randperm(N)'; %'
[~,I]=sort(sum((double(dec2bin(indexes-1,columns))-48),2));
accumlast=zeros(N,1);

for k=indexes(I)' %'
    accumlast(k)=accum(k);
    if sum(accumlast)>=m
        break
    end
end

accumleft=accum-accumlast;

With randomly generated data where there were around 2 times more 1's than -1's, and the sum of each column was around 80k, I can find a subset of 100k data with sum of around 5k in each column.

Third case, is when some columns sum are close to zero and some aren't. In this case you separate the columns into the ones with big sum and the ones with small sum, then sort the data by the hamming weight of the big sum columns, and get the pairs of the small sum columns inside each of the big columns index. This will create a matrix with the number of possible pairs, the number of unpairbale rows, and the sum of the unpairable rows of the small columns, for each index of the big sum columns.

Now you can use that information to keep a running sum and see which indexes of the big sum columns to add to your subset, and also if it's worth to add the parable or the unpairable data in each case.

function [accumout,accumleft]=getseparated(accum, bigcol, smallcol, m)

data=accum2mat(accum);

'indexing'
bigindex=bin2dec(num2str((data(:,bigcol)+1)/2))+1;
[~,bn]=size(bigcol);
[~,sn]=size(smallcol);

'Hamming weight'
b_ind=randperm(2^bn)'; %'
[~,I]=sort(sum((double(dec2bin(b_ind-1,bn))-48),2));

temp=zeros(2^bn,4+sn);

w=waitbar(0,'Processing');
for k=1:2^bn;
    small_data=data(bigindex==b_ind(I(k)),smallcol);
    if small_data
        small_accum=mat2accum(small_data);
        [small_accumpairs, small_accum]=getpairs(small_accum);
        n_pairs=sum(small_accumpairs);
        n_non_pairs=sum(small_accum);
        sum_non_pairs=sum(accum2mat(small_accum));
    else
        n_pairs=0;
        n_non_pairs=0;
        sum_non_pairs=zeros(1,sn);
    end
    ham_weight=sum((double(dec2bin(b_ind(I(k))-1,bn))-48),2);
    temp(k,:)=[b_ind(I(k)) n_pairs n_non_pairs ham_weight sum_non_pairs];
    waitbar(k/2^bn);
end

close(w)

pair_ind=1;
nonpair_ind=1;
runningsum=[0 0 0 0 0 0 0 0 0 0];
temp2=zeros(2^bn,2);

while sum(sum(temp2))<=m
     if pair_ind<=2^bn
         pairsum=[(((double(dec2bin((temp(pair_ind,1)-1),bn))-48)*2)-1)*temp(pair_ind,2) zeros(1,sn)];
     end
     if nonpair_ind<=2^bn
         nonpairsum=[(((double(dec2bin((temp(nonpair_ind,1)-1),bn))-48)*2)-1)*temp(nonpair_ind,3) temp(nonpair_ind,5:5+sn-1)];
     end
     if nonpair_ind==(2^bn)+1
         temp2(pair_ind,1)=temp(pair_ind,2);
         runningsum=runningsum+pairsum;
         pair_ind=pair_ind+1;
     elseif pair_ind==(2^bn)+1
         temp2(nonpair_ind,2)=temp(nonpair_ind,3);
         runningsum=runningsum+nonpairsum;
         nonpair_ind=nonpair_ind+1;
     elseif sum(abs(runningsum+pairsum))<=sum(abs(runningsum+nonpairsum))
         temp2(pair_ind,1)=temp(pair_ind,2);
         runningsum=runningsum+pairsum;
         pair_ind=pair_ind+1;
     elseif sum(abs(runningsum+pairsum))>sum(abs(runningsum+nonpairsum))
         temp2(nonpair_ind,2)=temp(nonpair_ind,3);
         runningsum=runningsum+nonpairsum;
         nonpair_ind=nonpair_ind+1;
     end
end

accumout=zeros(2^(bn+sn),1);

for k=1:2^bn
    if temp2(k,:)
        small_data=data(bigindex==temp(k,1),smallcol);
        if small_data
            small_accum=mat2accum(small_data);
            [small_accumpairs, small_accum]=getpairs(small_accum);
            pairs=accum2mat(small_accumpairs);
            non_pairs=accum2mat(small_accum);
        else
            pairs=zeros(1,sn);
            non_pairs=zeros(1,sn);
        end
        if temp2(k,1)
            datatemp=zeros(temp2(k,1),sn+bn);
            datatemp(:,bigcol)=((double(dec2bin(ones(temp2(k,1),1)*(temp(k,1)-1),bn))-48)*2)-1;
            datatemp(:,smallcol)=pairs;
            accumout=accumout+mat2accum(datatemp);
        end
        if temp2(k,2)
            datatemp=zeros(temp2(k,2),sn+bn);
            datatemp(:,bigcol)=((double(dec2bin(ones(temp2(k,2),1)*(temp(k,1)-1),bn))-48)*2)-1;
            datatemp(:,smallcol)=non_pairs;
            accumout=accumout+mat2accum(datatemp);
        end
    end
end

accumleft=accum-accumout;

With data composed of 5 columns of the first case and 5 columns of the second case, it was possible to constructt a set of 100k rows with <1k of sum in the small sum columns and between 10k and 30k in the big ones.

It's worth noting that the size of data, size of the required subset, and distributtion of 1's and -1's, will have a great effect in the performance of this algorithmms.

answered Oct 29 '22 03:10

Noel Segura Meraz

This problem, sadly, falls outside the scope of regular (continuous) optimization. Your problem, which can be parameterized as follows:

min_{S∈S_n} Σ_{j∈S}|Σ_i data_ji|

Where S_n is the set of n-elements combinations of indices j∈{0,...,250000}, can also be rewritten as a very similar regular Quadratic Integer Programming problem in x:

min_x x'* data *data' *x
0<=x<=1 and x*1=n

Where data is your 250000*10 matrix and x is the 250000*1 vector of combinations we're looking for. (Now we optimize the sum of squares instead of the sum of absolute values... )

This problem is proven to be NP-hard, which means to find the global minimizer, you must go through all possible combinations of n draws in 250000 possibilities, which is equal to the binomial coefficient (250000 n), which is equal to 250000!/(n!*(250000-n)!)...

So good luck... ;)

EDIT

If you're going to solve this heuristically, since I suppose you're going to need a solution, use the heuristics here instead of your approach.

answered Oct 29 '22 03:10

reverse_engineer

Related questions
                            
                                get previous array values in foreach
                            
                                Check if item exists in array React
                            
                                How to add PHP pagination in array's
                            
                                javascript, jQuery - check if value exists in array [duplicate]
                            
                                Freezing variables in Ruby doesn't work [duplicate]
                            
                                Javascript underscore array to object
                            
                                How to remove all elements in String array in java? [duplicate]
                            
                                Swapping rows and columns
                            
                                Merge all sub arrays into one [duplicate]
                            
                                passing jsonarray from 1 activity to another
                            
                                One-liner to count number of occurrences of String in a String[] in Java?
                            
                                The intersection of two sorted arrays
                            
                                Sum an Array - Objective-C
                            
                                Sending array of data as email
                            
                                Remove duplicate values from an array of objects in javascript [duplicate]
                            
                                NameError: name 'array' is not defined in python
                            
                                Array-to-CSV-export function is facing an issue in WordPress-plugin
                            
                                numpy.searchsorted with 2D array
                            
                                Two-dimensional maximum subarray

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Minimising the sum of array columns in Matlab

Tags:

arrays

algorithm

optimization

statistics

matlab

Chris Parry

People also ask

3 Answers

Erwin Kalvelagen

Noel Segura Meraz

reverse_engineer

Recent Activity

Donate For Us