Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding clustered NaNs but leaving lonely NaNs alone

Tags:

nan

matlab

I have an incomplete dataset,

http://imgur.com/Tpu6Hcf

N = [NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN]'

I wish to identify a cluster of Nans, that is, if the subsequent number of them exceeds 2. how do i do that?

like image 582
Easyquestionsonly Avatar asked Jan 22 '26 00:01

Easyquestionsonly


1 Answers

You could do something like this:

aux = diff([0; isnan(N); 0]);
clusters = [find(aux == 1) find(aux == -1) - 1];

Then clusters will be a Nx2 matrix, where N is the number of NaN clusters (all of them), and each row gives you the start and end index of the cluster.

In this example, that would be:

clusters =

     1     1
     5     5
     8     9
    15    19

It means you have 4 NaN clusters, and cluster one ranges from index 1 to index 1, cluster two ranges from 5 to 5, cluster three ranges from 8 to 9 and cluster four ranges from 15 to 19.

If you want only the clusters with at least K NaNs, you could do it like this (for example, with K = 2):

K = 2;
clusters(clusters(:,2) - clusters(:,1) + 1 >= K, :)

That would give you this:

ans =

     8     9
    15    19

That is, clusters 8-9 and 15-19 have 2 or more NaNs.

Explanation:

  • Finding the clusters

isnan(N) gives you a logical vector containing the NaNs as ones:

N --------> NaN 1  2  3 NaN 5  6 NaN NaN 7  8 10 12 20 NaN NaN NaN NaN NaN
isnan(N) ->  1  0  0  0  1  0  0  1   1  0  0  0  0  0  1   1   1   1   1

We want to know where each sequence of ones start, so we use diff, which calculates each value minus the previous one, and gives us this:

aux = diff(isnan(N));
N ----> NaN 1  2  3 NaN 5  6 NaN NaN 7  8 10 12 20 NaN NaN NaN NaN NaN
aux --> -1  0  0  1 -1  0  1  0  -1  0  0  0  0  1   0   0   0   0

Where a 1 indicates the group start and a -1 indicates a group end. But it misses the first group start and the last group end, because the first 1 element is absent (it doesn't have a previous on N because it is the first) and the last -1 is absent too (because there is nothing after the last 1 on N). A common fix is to add a zero before and after the array, which gives us this:

aux = diff([0; isnan(N); 0]);
N ----> NaN 1  2  3 NaN 5  6 NaN NaN 7  8 10 12 20 NaN NaN NaN NaN NaN
aux -->  1 -1  0  0  1 -1  0  1  0  -1  0  0  0  0  1   0   0   0   0  -1

Notice two things:

  1. If the diff at index i is 1, N(i) is the start of the NaN block.
  2. If the diff at index i is -1, N(i - 1) is the end of the NaN block.

To get the start and end, we use find to get the indexes where aux == 1 and aux == -1. Hence, we call find twice, and concatenate both calls using [ and ]:

aux = diff([0; isnan(N); 0]);
clusters = [find(aux == 1) find(aux == -1) - 1];
  • Filtering the clusters whick have K or more elements

The last step is to find clusters which have K or more elements. To do that, we first take the cluster matrix and subtract the first column from the first, and add 1, like this:

clusters(:,2) - clusters(:,1) + 1
ans = 
     1
     1
     2
     5

It means clusters 1 and 2 have 1 NaN, cluster 3 have 3 NaNs and cluster 4 have 5 NaNs. If we ask which values are greather than or equal K, we get this:

clusters(:,2) - clusters(:,1) + 1 >= K
ans =
     0
     0
     1
     1

It's a logical array. We can use that to index only the 1 (true) rows of the cluster matrix, like this:

clusters(clusters(:,2) - clusters(:,1) + 1 >= K, :)
ans =

     8     9
    15    19

It's like asking: give us only the clusters where the rows match the ones on this logical vector, and give us all columns (denoted by the :).

like image 189
Rafael Monteiro Avatar answered Jan 26 '26 23:01

Rafael Monteiro



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!