I have an incomplete dataset,

N = [NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN]'
I wish to identify a cluster of Nans, that is, if the subsequent number of them exceeds 2. how do i do that?
You could do something like this:
aux = diff([0; isnan(N); 0]);
clusters = [find(aux == 1) find(aux == -1) - 1];
Then clusters will be a Nx2 matrix, where N is the number of NaN clusters (all of them), and each row gives you the start and end index of the cluster.
In this example, that would be:
clusters =
1 1
5 5
8 9
15 19
It means you have 4 NaN clusters, and cluster one ranges from index 1 to index 1, cluster two ranges from 5 to 5, cluster three ranges from 8 to 9 and cluster four ranges from 15 to 19.
If you want only the clusters with at least K NaNs, you could do it like this (for example, with K = 2):
K = 2;
clusters(clusters(:,2) - clusters(:,1) + 1 >= K, :)
That would give you this:
ans =
8 9
15 19
That is, clusters 8-9 and 15-19 have 2 or more NaNs.
Explanation:
isnan(N) gives you a logical vector containing the NaNs as ones:
N --------> NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN
isnan(N) -> 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 1 1 1
We want to know where each sequence of ones start, so we use diff, which calculates each value minus the previous one, and gives us this:
aux = diff(isnan(N));
N ----> NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN
aux --> -1 0 0 1 -1 0 1 0 -1 0 0 0 0 1 0 0 0 0
Where a 1 indicates the group start and a -1 indicates a group end. But it misses the first group start and the last group end, because the first 1 element is absent (it doesn't have a previous on N because it is the first) and the last -1 is absent too (because there is nothing after the last 1 on N). A common fix is to add a zero before and after the array, which gives us this:
aux = diff([0; isnan(N); 0]);
N ----> NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN
aux --> 1 -1 0 0 1 -1 0 1 0 -1 0 0 0 0 1 0 0 0 0 -1
Notice two things:
i is 1, N(i) is the start of the NaN block.i is -1, N(i - 1) is the end of the NaN block.To get the start and end, we use find to get the indexes where aux == 1 and aux == -1. Hence, we call find twice, and concatenate both calls using [ and ]:
aux = diff([0; isnan(N); 0]);
clusters = [find(aux == 1) find(aux == -1) - 1];
The last step is to find clusters which have K or more elements. To do that, we first take the cluster matrix and subtract the first column from the first, and add 1, like this:
clusters(:,2) - clusters(:,1) + 1
ans =
1
1
2
5
It means clusters 1 and 2 have 1 NaN, cluster 3 have 3 NaNs and cluster 4 have 5 NaNs. If we ask which values are greather than or equal K, we get this:
clusters(:,2) - clusters(:,1) + 1 >= K
ans =
0
0
1
1
It's a logical array. We can use that to index only the 1 (true) rows of the cluster matrix, like this:
clusters(clusters(:,2) - clusters(:,1) + 1 >= K, :)
ans =
8 9
15 19
It's like asking: give us only the clusters where the rows match the ones on this logical vector, and give us all columns (denoted by the :).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With