Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find rows with k-consecutive NaNs in Pandas

Given the following example:

 df = 
  0   NaN   5.0   NaN   6.0   NaN      
  1   5.0   6.0   6.0   NaN   NaN      
  2   6.0   6.0   NaN   NaN   NaN      
  3   6.0   NaN   NaN   NaN   6.0      
  4   NaN   NaN   NaN   6.0   NaN      
  5   6.0   6.0   6.0   8.0   7.0    
  6   6.0   6.0   8.0   7.0   8.0    
  7   6.0   8.0   7.0   8.0   8.0     
  8   8.0   7.0   8.0   8.0   NaN     
  9   7.0   8.0   8.0   NaN   9.0     

how to find rows with consecutive k-NaN? For example, for k=3, the desired rows are [2,3,4]

like image 229
Arnold Klein Avatar asked May 11 '17 12:05

Arnold Klein


People also ask

How do you tell the difference between consecutive rows in pandas?

diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.


3 Answers

In [164]: df[df.astype(str).sum(1).str.contains(''.join(['nan']*3))]
Out[164]:
   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

Explanation:

In [166]: df.astype(str).sum(1)
Out[166]:
0    0nan5.06.06.0nan
1    15.06.06.0nannan
2    26.06.0nannannan
3    36.0nannannan6.0
4    4nannannan6.0nan
5    56.06.06.08.07.0
6    66.06.08.07.08.0
7    76.08.07.08.08.0
8    88.07.08.08.0nan
9    97.08.08.0nan9.0
dtype: object

In [167]: ''.join(['nan']*3)
Out[167]: 'nannannan'
like image 169
MaxU - stop WAR against UA Avatar answered Oct 05 '22 13:10

MaxU - stop WAR against UA


You can use a rolling window counting nans:

>>> import numpy as np
>>> np.flatnonzero(np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1))
array([2, 3, 4], dtype=int64)

To get the corresponding rows just use iloc:

>>> df.iloc[rows_with_k_consecutive_nans(df, )]
   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

This can also be wrapped in a function:

def rows_with_k_consecutive_nans(df, k):
    """This is exactly like the above but using pandas functions instead of
    numpys. (see also Scott Boston answer). The approach is completly identical!
    """
    return df.isnull().rolling(window=k, axis=1).sum().ge(k).any(axis=1)

>>> df[rows_with_k_consecutive_nans(df, 3)]  # no iloc here!
   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

>>> df[rows_with_k_consecutive_nans(df, 2)]  # with 2 consecutives
   0    1    2    3    4    5
1  1  5.0  6.0  6.0  NaN  NaN
2  2  6.0  6.0  NaN  NaN  NaN
3  3  6.0  NaN  NaN  NaN  6.0
4  4  NaN  NaN  NaN  6.0  NaN

Step-by-step:

I'll only explain the numpy approach, the pandas functions are almost identical to these.

np.isnan to find the NaNs

>>> np.isnan(df)
       0      1      2      3      4      5
0  False   True  False   True  False   True
1  False  False  False  False   True   True
2  False  False  False   True   True   True
3  False  False   True   True   True  False
4  False   True   True   True  False   True
5  False  False  False  False  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False  False
8  False  False  False  False  False   True
9  False  False  False  False   True  False

pd.DataFrame.rolling to get the consecutive NaNs

>>> np.isnan(df).rolling(window=3, axis=1).sum()
    0   1    2    3    4    5
0 NaN NaN  1.0  2.0  1.0  2.0
1 NaN NaN  0.0  0.0  1.0  2.0
2 NaN NaN  0.0  1.0  2.0  3.0
3 NaN NaN  1.0  2.0  3.0  2.0
4 NaN NaN  2.0  3.0  2.0  2.0
5 NaN NaN  0.0  0.0  0.0  0.0
6 NaN NaN  0.0  0.0  0.0  0.0
7 NaN NaN  0.0  0.0  0.0  0.0
8 NaN NaN  0.0  0.0  0.0  1.0
9 NaN NaN  0.0  0.0  1.0  1.0

Check for 3 consective NaNs

>>> np.isnan(df).rolling(window=3, axis=1).sum() >= 3
       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False  False  False  False  False  False
2  False  False  False  False  False   True
3  False  False  False  False   True  False
4  False  False  False   True  False  False
5  False  False  False  False  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False  False
8  False  False  False  False  False  False
9  False  False  False  False  False  False

>>> np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1)  # rows with at least 1 True
array([False, False,  True,  True,  True, False, False, False, False, False], dtype=bool)

np.flatnonzero gives you the indices of Trues.

>>> np.flatnonzero(np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1))
array([2, 3, 4], dtype=int64)
like image 25
MSeifert Avatar answered Oct 05 '22 11:10

MSeifert


MSeifert's rolling solution with pandas:

 df[df.isnull().rolling(window=3,axis=1).sum().ge(3).any(axis=1)]

Output:

   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN
like image 35
Scott Boston Avatar answered Oct 05 '22 12:10

Scott Boston