I have a long matrix of numbers that represent molecular states. A subset might look like this:
states = [...
1 1 1 1
1 1 1 1
1 0 1 1
NaN 0 NaN NaN
1 0 1 0
1 0 1 1
NaN NaN NaN NaN
1 0 1 1
NaN NaN NaN NaN
1 1 0 0
];
where the NaN
values are for states where the representation is unknown. In practice, this list might have hundreds of thousands of values. If I use the unique
command to get the unique states, the result looks like
K>>unique(states,'rows')
ans =
1 0 1 0
1 0 1 1
1 1 0 0
1 1 1 1
NaN 0 NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
because "unique treats NaN values as distinct".
How can I massage this output such that NaN values are not distinct? So that [NaN NaN NaN NaN]
is distinct from [NaN 0 NaN NaN]
but [NaN NaN NaN NaN] == [NaN NaN NaN NaN]
?
In computing, NaN (/næn/), standing for Not a Number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic.
MATLAB preserves the “Not a Number” status of alternate NaN representations and treats all of the different representations of NaN equivalently.
Code
%// Get unique rows with in-built "unique" that considers NaN as distinct
unq1 = unique(states,'rows');
%// Detect nans
unq1_nans = isnan(unq1);
%// Find nan equalities across rows
unq1_nans_roweq = bsxfun(@plus,unq1_nans,permute(unq1_nans,[3 2 1]))==2;
%// Find non-nan equalities across rows
unq1_nonans_roweq = bsxfun(@eq,unq1,permute(unq1,[3 2 1]));
%// Find "universal" (nan or non-nan) equalities across rows
unq1_univ_roweq = unq1_nans_roweq | unq1_nonans_roweq;
%// Remove non-unique rows except the first non-unique match as with
%// the default functionality of MATLAB's in-built unique function
out = unq1(~any(triu(squeeze(sum(unq1_univ_roweq,2)==size(states,2)),1),1),:);
Example #1
Input -
states =
3.0000 1.0000 7.0000 8.0000
8.0000 0 1.0000 6.0000
Inf 0 NaN NaN
5.0000 0 1.0000 0
Inf 0 NaN NaN
7.0000 0 5.0000 1.0000
NaN NaN 11.2000 Inf
NaN NaN 15.0000 NaN
NaN NaN 11.2000 Inf
Intermediate result using MATLAB's in-built unique
+ 'rows'
-
unq1 =
3.0000 1.0000 7.0000 8.0000
5.0000 0 1.0000 0
7.0000 0 5.0000 1.0000
8.0000 0 1.0000 6.0000
Inf 0 NaN NaN
Inf 0 NaN NaN
NaN NaN 11.2000 Inf
NaN NaN 11.2000 Inf
NaN NaN 15.0000 NaN
Notice that two rows with identical values - [Inf 0 NaN NaN]
are still showing up and similarly we have another identical pair - [NaN NaN 11.2000 Inf]
. We need to keep one unique row for each of these two pairs. Let's see how our code performs -
out =
3.0000 1.0000 7.0000 8.0000
5.0000 0 1.0000 0
7.0000 0 5.0000 1.0000
8.0000 0 1.0000 6.0000
Inf 0 NaN NaN
NaN NaN 11.2000 Inf
NaN NaN 15.0000 NaN
It worked alright!
Example #2
For the final test, let's test it out for cases when we have big numbers too in the input array like this one -
states =
3 1 7 8
8 0 1 6
Inf 0 NaN NaN
5 0 1 0
Inf 0 NaN NaN
7 0 5 1
NaN NaN 1e+100 Inf
NaN NaN 15 NaN
NaN NaN 1e+100 Inf
The intermediate result with unique
+ 'rows'
would be -
unq1 =
3 1 7 8
5 0 1 0
7 0 5 1
8 0 1 6
Inf 0 NaN NaN
Inf 0 NaN NaN
NaN NaN 15 NaN
NaN NaN 1e+100 Inf
NaN NaN 1e+100 Inf
So, our code must remove one of the final two rows.
out =
3 1 7 8
5 0 1 0
7 0 5 1
8 0 1 6
Inf 0 NaN NaN
NaN NaN 15 NaN
NaN NaN 1e+100 Inf
It does!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With