Given the following example:
df =
0 NaN 5.0 NaN 6.0 NaN
1 5.0 6.0 6.0 NaN NaN
2 6.0 6.0 NaN NaN NaN
3 6.0 NaN NaN NaN 6.0
4 NaN NaN NaN 6.0 NaN
5 6.0 6.0 6.0 8.0 7.0
6 6.0 6.0 8.0 7.0 8.0
7 6.0 8.0 7.0 8.0 8.0
8 8.0 7.0 8.0 8.0 NaN
9 7.0 8.0 8.0 NaN 9.0
how to find rows with consecutive k-NaN? For example, for k=3
, the desired rows are [2,3,4]
diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.
In [164]: df[df.astype(str).sum(1).str.contains(''.join(['nan']*3))]
Out[164]:
0 1 2 3 4 5
2 2 6.0 6.0 NaN NaN NaN
3 3 6.0 NaN NaN NaN 6.0
4 4 NaN NaN NaN 6.0 NaN
Explanation:
In [166]: df.astype(str).sum(1)
Out[166]:
0 0nan5.06.06.0nan
1 15.06.06.0nannan
2 26.06.0nannannan
3 36.0nannannan6.0
4 4nannannan6.0nan
5 56.06.06.08.07.0
6 66.06.08.07.08.0
7 76.08.07.08.08.0
8 88.07.08.08.0nan
9 97.08.08.0nan9.0
dtype: object
In [167]: ''.join(['nan']*3)
Out[167]: 'nannannan'
You can use a rolling window counting nan
s:
>>> import numpy as np
>>> np.flatnonzero(np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1))
array([2, 3, 4], dtype=int64)
To get the corresponding rows just use iloc
:
>>> df.iloc[rows_with_k_consecutive_nans(df, )]
0 1 2 3 4 5
2 2 6.0 6.0 NaN NaN NaN
3 3 6.0 NaN NaN NaN 6.0
4 4 NaN NaN NaN 6.0 NaN
This can also be wrapped in a function:
def rows_with_k_consecutive_nans(df, k):
"""This is exactly like the above but using pandas functions instead of
numpys. (see also Scott Boston answer). The approach is completly identical!
"""
return df.isnull().rolling(window=k, axis=1).sum().ge(k).any(axis=1)
>>> df[rows_with_k_consecutive_nans(df, 3)] # no iloc here!
0 1 2 3 4 5
2 2 6.0 6.0 NaN NaN NaN
3 3 6.0 NaN NaN NaN 6.0
4 4 NaN NaN NaN 6.0 NaN
>>> df[rows_with_k_consecutive_nans(df, 2)] # with 2 consecutives
0 1 2 3 4 5
1 1 5.0 6.0 6.0 NaN NaN
2 2 6.0 6.0 NaN NaN NaN
3 3 6.0 NaN NaN NaN 6.0
4 4 NaN NaN NaN 6.0 NaN
I'll only explain the numpy approach, the pandas functions are almost identical to these.
np.isnan
to find the NaN
s>>> np.isnan(df)
0 1 2 3 4 5
0 False True False True False True
1 False False False False True True
2 False False False True True True
3 False False True True True False
4 False True True True False True
5 False False False False False False
6 False False False False False False
7 False False False False False False
8 False False False False False True
9 False False False False True False
pd.DataFrame.rolling
to get the consecutive NaNs>>> np.isnan(df).rolling(window=3, axis=1).sum()
0 1 2 3 4 5
0 NaN NaN 1.0 2.0 1.0 2.0
1 NaN NaN 0.0 0.0 1.0 2.0
2 NaN NaN 0.0 1.0 2.0 3.0
3 NaN NaN 1.0 2.0 3.0 2.0
4 NaN NaN 2.0 3.0 2.0 2.0
5 NaN NaN 0.0 0.0 0.0 0.0
6 NaN NaN 0.0 0.0 0.0 0.0
7 NaN NaN 0.0 0.0 0.0 0.0
8 NaN NaN 0.0 0.0 0.0 1.0
9 NaN NaN 0.0 0.0 1.0 1.0
>>> np.isnan(df).rolling(window=3, axis=1).sum() >= 3
0 1 2 3 4 5
0 False False False False False False
1 False False False False False False
2 False False False False False True
3 False False False False True False
4 False False False True False False
5 False False False False False False
6 False False False False False False
7 False False False False False False
8 False False False False False False
9 False False False False False False
>>> np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1) # rows with at least 1 True
array([False, False, True, True, True, False, False, False, False, False], dtype=bool)
np.flatnonzero
gives you the indices of True
s.>>> np.flatnonzero(np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1))
array([2, 3, 4], dtype=int64)
MSeifert's rolling solution with pandas:
df[df.isnull().rolling(window=3,axis=1).sum().ge(3).any(axis=1)]
Output:
0 1 2 3 4 5
2 2 6.0 6.0 NaN NaN NaN
3 3 6.0 NaN NaN NaN 6.0
4 4 NaN NaN NaN 6.0 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With