Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mean computation from accumulated row values while ignoring NaN in MATLAB

I'm looking for suggestions on how to solve the following problem elegantly. Although performance isn't an issue in my specific case, I'd appreciate comments regarding good practices.

Thanks in advance!

The short version:

I'm trying to average matrix rows according to some logic, while ignoring NaN values. The code I currently have does not handle NaN values the way I want.

The long version:

My data is built in the following manner:

  • A single (first) column of "bins". The amount of rows for every bin is not constant. The bins don't have to be integers. Rows are pre-sorted.
  • A variable number of data columns, possibly including NaNs.

Here's an example:

DATA = [...
180     NaN     NaN     1.733
180     NaN     NaN     1.703
200     0.720   2.117   1.738
200     0.706   2.073   1.722
200     0.693   2.025   1.723
200     NaN     NaN     1.729
210     NaN     NaN     1.820
210     NaN     NaN     1.813
210     NaN     NaN     1.805
240     NaN     NaN     1.951
240     NaN     NaN     1.946
240     NaN     NaN     1.946
270     NaN     NaN     2.061
270     NaN     NaN     2.052
300     0.754   2.356   2.103
300     0.758   2.342   2.057
300     NaN     NaN     2.066
300     NaN     NaN     2.066 ];

The desired result is a matrix that contains the unique "bins" in the first column, and means "unspoiled by NaNs" in the rest, e.g.:

  • If for a specific column+bin, there are only NaNs (in the above example: 1st data column+bin 210) - the result would be NaN.
  • If for a specific column+bin there is a mix of NaNs and numbers, the result would be the mean of the valid numbers. In the above example: 1st data column+bin 200 should give (0.720+0.706+0.693)/3=0.7063 -- note the division by 3 (and not 4) for this column+bin.

Here's the desired result for the above example:

RES = [...
180     NaN     NaN     1.718
200     0.7063  2.072   1.728
210     NaN     NaN     1.812
240     NaN     NaN     1.948
270     NaN     NaN     2.056
300     0.756   2.349   2.074 ];

What I tried so far:

This is some code I managed to compile from several sources. It is working well for column+bin that contain NaNs or numbers only.

nDataCols=size(DATA,2)-1;
[u,m,n] = unique(DATA(:,1));
sz = size(m);
N=accumarray(n,1,sz);

RES(length(u),nDataCols) = 0; %Preallocation

for ind1 = 1:nDataCols
    RES(:,ind1)=accumarray(n,DATA(:,ind1+1),sz)./N;
end

RES= [u,RES];

Here's what I'm currently getting:

RES = [...
180     NaN     NaN     1.718
200     NaN     NaN     1.728
210     NaN     NaN     1.812
240     NaN     NaN     1.948
270     NaN     NaN     2.056
300     NaN     NaN     2.074 ];

p.s.

  1. If by any chance this is easier to do using a spreadsheet software (such as MS Excel) - I'd love to hear ideas.
  2. Doing the computation on a per-column basis is my current idea on how to handle this. I was just wondering if there's a way to generalize it to take the complete matrix right away.
like image 501
Dev-iL Avatar asked Jul 13 '14 15:07

Dev-iL


1 Answers

One possible approach: find changes in first column (exploiting the fact that it's pre-sorted) and apply nanmean to each block of rows:

ind = find(diff([-inf; (DATA(:,1)); inf])~=0); %// value changed: start of block
r = arrayfun(@(n) nanmean(DATA(ind(n):ind(n+1)-1,:)), 1:numel(ind)-1, 'uni', 0);
RES = vertcat(r{:});

You can replace arrayfun by an explicit loop. That may be faster, and avoids the overhead introduced by cells:

ind = find(diff([-inf; (DATA(:,1)); inf])~=0); %// value changed: start of block
RES = zeros(numel(ind)-1, size(DATA,2)); %// preallocate
for n = 1:numel(ind)-1 %// loop over blocks
    RES(n,:) = nanmean(DATA(ind(n):ind(n+1)-1,:));
end

Your approach can be used as well. You only need to call accumarray with a handle to the nanmean function. This doesn't require the first column to be pre-sorted.

nDataCols = size(DATA,2)-1;
[u, ~, n] = unique(DATA(:,1));
RES = zeros(length(u), nDataCols); %// Preallocation
for ind1 = 1:nDataCols
    RES(:,ind1) = accumarray(n, DATA(:,ind1+1), [], @nanmean);
end
RES = [u, RES];
like image 76
Luis Mendo Avatar answered Oct 12 '22 06:10

Luis Mendo