Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Force `unique` to treat NaNs as indistinct

Tags:

matlab

I have a long matrix of numbers that represent molecular states. A subset might look like this:

 states = [...
  1     1     1     1
  1     1     1     1
  1     0     1     1
NaN     0   NaN   NaN
  1     0     1     0
  1     0     1     1
NaN   NaN   NaN   NaN
  1     0     1     1
NaN   NaN   NaN   NaN
  1     1     0     0
 ];

where the NaN values are for states where the representation is unknown. In practice, this list might have hundreds of thousands of values. If I use the unique command to get the unique states, the result looks like

K>>unique(states,'rows')

ans = 

     1     0     1     0
     1     0     1     1
     1     1     0     0
     1     1     1     1
   NaN     0   NaN   NaN
   NaN   NaN   NaN   NaN
   NaN   NaN   NaN   NaN

because "unique treats NaN values as distinct".

How can I massage this output such that NaN values are not distinct? So that [NaN NaN NaN NaN] is distinct from [NaN 0 NaN NaN] but [NaN NaN NaN NaN] == [NaN NaN NaN NaN]?

like image 867
craigim Avatar asked Aug 26 '14 23:08

craigim


People also ask

What do you mean by NaN unique?

In computing, NaN (/næn/), standing for Not a Number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic.

How does Matlab treat NaN?

MATLAB preserves the “Not a Number” status of alternate NaN representations and treats all of the different representations of NaN equivalently.


1 Answers

Code

%// Get unique rows with in-built "unique" that considers NaN as distinct
unq1 = unique(states,'rows');

%// Detect nans
unq1_nans = isnan(unq1);

%// Find nan equalities across rows
unq1_nans_roweq = bsxfun(@plus,unq1_nans,permute(unq1_nans,[3 2 1]))==2;

%// Find non-nan equalities across rows
unq1_nonans_roweq = bsxfun(@eq,unq1,permute(unq1,[3 2 1]));

%// Find "universal" (nan or non-nan) equalities across rows
unq1_univ_roweq = unq1_nans_roweq | unq1_nonans_roweq;

%// Remove non-unique rows except the first non-unique match as with 
%// the default functionality of MATLAB's in-built unique function
out = unq1(~any(triu(squeeze(sum(unq1_univ_roweq,2)==size(states,2)),1),1),:);

Example #1

Input -

states =
    3.0000    1.0000    7.0000    8.0000
    8.0000         0    1.0000    6.0000
       Inf         0       NaN       NaN
    5.0000         0    1.0000         0
       Inf         0       NaN       NaN
    7.0000         0    5.0000    1.0000
       NaN       NaN   11.2000       Inf
       NaN       NaN   15.0000       NaN
       NaN       NaN   11.2000       Inf

Intermediate result using MATLAB's in-built unique + 'rows' -

unq1 =
    3.0000    1.0000    7.0000    8.0000
    5.0000         0    1.0000         0
    7.0000         0    5.0000    1.0000
    8.0000         0    1.0000    6.0000
       Inf         0       NaN       NaN
       Inf         0       NaN       NaN
       NaN       NaN   11.2000       Inf
       NaN       NaN   11.2000       Inf
       NaN       NaN   15.0000       NaN

Notice that two rows with identical values - [Inf 0 NaN NaN] are still showing up and similarly we have another identical pair - [NaN NaN 11.2000 Inf]. We need to keep one unique row for each of these two pairs. Let's see how our code performs -

out =
    3.0000    1.0000    7.0000    8.0000
    5.0000         0    1.0000         0
    7.0000         0    5.0000    1.0000
    8.0000         0    1.0000    6.0000
       Inf         0       NaN       NaN
       NaN       NaN   11.2000       Inf
       NaN       NaN   15.0000       NaN

It worked alright!

Example #2

For the final test, let's test it out for cases when we have big numbers too in the input array like this one -

states =
            3            1            7            8
            8            0            1            6
          Inf            0          NaN          NaN
            5            0            1            0
          Inf            0          NaN          NaN
            7            0            5            1
          NaN          NaN       1e+100          Inf
          NaN          NaN           15          NaN
          NaN          NaN       1e+100          Inf

The intermediate result with unique + 'rows' would be -

unq1 =
            3            1            7            8
            5            0            1            0
            7            0            5            1
            8            0            1            6
          Inf            0          NaN          NaN
          Inf            0          NaN          NaN
          NaN          NaN           15          NaN
          NaN          NaN       1e+100          Inf
          NaN          NaN       1e+100          Inf

So, our code must remove one of the final two rows.

out =
            3            1            7            8
            5            0            1            0
            7            0            5            1
            8            0            1            6
          Inf            0          NaN          NaN
          NaN          NaN           15          NaN
          NaN          NaN       1e+100          Inf

It does!

like image 127
Divakar Avatar answered Sep 22 '22 12:09

Divakar