Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficiently split data into bins

I want to split my data variable into different variables a b and c, and apply mean to the bins (1st dimension). Is there way to substantially (e.g. 1x order of magnitude) improve this code in terms of speed? General feedback welcome

data=rand(20,1000); %generate data
bins=[5 10 5]; %given size of bins
start_bins=cumsum([1 bins(1:end-1)]);
end_bins=cumsum([bins]);
%split the data into 3 cell arrays and apply mean in 1st dimension
binned_data=cellfun(@(x,y) mean(data(x:y,:),1),num2cell(start_bins),num2cell(end_bins),'uni',0);
%data (explicitly) has be stored into different variables
[a,b,c]=deal(binned_data{:});
whos a b c
  Name      Size              Bytes  Class     Attributes

  a         1x1000             8000  double              
  b         1x1000             8000  double              
  c         1x1000             8000  double              
like image 976
user2305193 Avatar asked Jan 16 '19 13:01

user2305193


3 Answers

You can use splitapply (accumarray's slightly friendlier little brother):

% Your example
data = rand(20,1000); % generate data
bins = [5 10 5];      % given size of bins

% Calculation
bins = repelem(1:numel(bins), bins).'; % Bin sizes to group labels
binned_data = splitapply( @mean, data, bins ); % splitapply for calculation

The rows of binned_data are your a, b and c.

like image 153
Wolfie Avatar answered Oct 23 '22 07:10

Wolfie


Original question: splitting and averaging along different dims

The mean can be applied before the splitting, which reduces the data to a vector, and then accumarray can be used:

binned_data = accumarray(repelem(1:numel(bins), bins).', mean(data,2), [], @(x){x.'});

Edited question: splitting and averaging along same dim

accumarray1 does not work with matrix data. But you can use sparse, which automatically accumulates data values corresponding to the same indices:

ind_rows = repmat(repelem((1:numel(bins)).', bins), 1, size(data,2));
ind_cols = repmat(1:size(data,2), size(data,1), 1);
binned_data = sparse(ind_rows, ind_cols, data);
binned_data = bsxfun(@rdivide, binned_data, bins(:));
binned_data = num2cell(binned_data, 2).';

But splitapply does. See @Wolfie's answer.

like image 34
Luis Mendo Avatar answered Oct 23 '22 07:10

Luis Mendo


You can use matrix multiplication:

r = 1:numel(bins);
result = (r.' == repelem(r,bins)) * data .* (1./bins(:));

If you want the output as cell:

result = num2cell(result,2);

For large matrices it is better to use sparse matrix:

result = sparse(r.' == repelem(r,bins)) * data .* (1./bins(:));

Note: In previous versions of MATLAB you should use bsxfun:

result = bsxfun(@times,bsxfun(@eq, r.',repelem(r,bins)) * data , (1./bins(:)))

Here is the result of timing for three proposed methods in Octave:

Matrix Multiplication:

0.00197697 seconds

Accumarray:

0.00465298 seconds

Cellfun:

0.00718904 seconds

EDIT : For a 200 x 100000 matrix :

Matrix Multiplication:

0.806947 seconds   sparse: 0.2331  seconds

Accumarray:

0.0398011 seconds

Cellfun:

0.386079  seconds
like image 26
rahnema1 Avatar answered Oct 23 '22 06:10

rahnema1