I'm looking for suggestions on how to solve the following problem elegantly. Although performance isn't an issue in my specific case, I'd appreciate comments regarding good practices.
Thanks in advance!
I'm trying to average matrix rows according to some logic, while ignoring NaN values. The code I currently have does not handle NaN values the way I want.
My data is built in the following manner:
Here's an example:
DATA = [...
180 NaN NaN 1.733
180 NaN NaN 1.703
200 0.720 2.117 1.738
200 0.706 2.073 1.722
200 0.693 2.025 1.723
200 NaN NaN 1.729
210 NaN NaN 1.820
210 NaN NaN 1.813
210 NaN NaN 1.805
240 NaN NaN 1.951
240 NaN NaN 1.946
240 NaN NaN 1.946
270 NaN NaN 2.061
270 NaN NaN 2.052
300 0.754 2.356 2.103
300 0.758 2.342 2.057
300 NaN NaN 2.066
300 NaN NaN 2.066 ];
The desired result is a matrix that contains the unique "bins" in the first column, and means "unspoiled by NaNs" in the rest, e.g.:
(0.720+0.706+0.693)/3=0.7063
-- note the division by 3 (and not 4) for this column+bin.Here's the desired result for the above example:
RES = [...
180 NaN NaN 1.718
200 0.7063 2.072 1.728
210 NaN NaN 1.812
240 NaN NaN 1.948
270 NaN NaN 2.056
300 0.756 2.349 2.074 ];
This is some code I managed to compile from several sources. It is working well for column+bin that contain NaNs or numbers only.
nDataCols=size(DATA,2)-1;
[u,m,n] = unique(DATA(:,1));
sz = size(m);
N=accumarray(n,1,sz);
RES(length(u),nDataCols) = 0; %Preallocation
for ind1 = 1:nDataCols
RES(:,ind1)=accumarray(n,DATA(:,ind1+1),sz)./N;
end
RES= [u,RES];
Here's what I'm currently getting:
RES = [...
180 NaN NaN 1.718
200 NaN NaN 1.728
210 NaN NaN 1.812
240 NaN NaN 1.948
270 NaN NaN 2.056
300 NaN NaN 2.074 ];
One possible approach: find changes in first column (exploiting the fact that it's pre-sorted) and apply nanmean
to each block of rows:
ind = find(diff([-inf; (DATA(:,1)); inf])~=0); %// value changed: start of block
r = arrayfun(@(n) nanmean(DATA(ind(n):ind(n+1)-1,:)), 1:numel(ind)-1, 'uni', 0);
RES = vertcat(r{:});
You can replace arrayfun
by an explicit loop. That may be faster, and avoids the overhead introduced by cells:
ind = find(diff([-inf; (DATA(:,1)); inf])~=0); %// value changed: start of block
RES = zeros(numel(ind)-1, size(DATA,2)); %// preallocate
for n = 1:numel(ind)-1 %// loop over blocks
RES(n,:) = nanmean(DATA(ind(n):ind(n+1)-1,:));
end
Your approach can be used as well. You only need to call accumarray
with a handle to the nanmean
function. This doesn't require the first column to be pre-sorted.
nDataCols = size(DATA,2)-1;
[u, ~, n] = unique(DATA(:,1));
RES = zeros(length(u), nDataCols); %// Preallocation
for ind1 = 1:nDataCols
RES(:,ind1) = accumarray(n, DATA(:,ind1+1), [], @nanmean);
end
RES = [u, RES];
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With