I would like to find those indexes of consecutive NaN in a Pandas DataFrame with more than 3 consecutive NaN returning their size. That is:
58234 NaN
58235 NaN
58236 0.424323
58237 0.424323
58238 NaN
58239 NaN
58240 NaN
58241 NaN
58242 NaN
58245 NaN
58246 1.483380
58247 1.483380
Should return something like (58238, 6). The actual format of the return doesn't matter too much. I have found the following.
df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
But it is not returning the right values per index. This question might be very similar to Identifying consecutive NaN's with pandas but any help would very appreciated as I am a total noob in Pandas.
Yes. pd. quantile() will ignore NaN values when calculating the quantile.
diff() function. This function calculates the difference between two consecutive DataFrame elements.
I broke down the steps:
df['Group']=df.a.notnull().astype(int).cumsum()
df=df[df.a.isnull()]
df=df[df.Group.isin(df.Group.value_counts()[df.Group.value_counts()>3].index)]
df['count']=df.groupby('Group')['Group'].transform('size')
df.drop_duplicates(['Group'],keep='first')
Out[734]:
a Group count
ID
58238 NaN 2 6
Assuming df
to have those as two columns named : A
, B
, here's one vectorized approach -
thresh = 3
a = df.A.values
b = df.B.values
idx0 = np.flatnonzero(np.r_[True, np.diff(np.isnan(b))!=0,True])
count = np.diff(idx0)
idx = idx0[:-1]
valid_mask = (count>=thresh) & np.isnan(b[idx])
out_idx = idx[valid_mask]
out_num = a[out_idx]
out_count = count[valid_mask]
out = zip(out_num, out_count)
Sample input, output -
In [285]: df
Out[285]:
A B
0 58234 NaN
1 58235 NaN
2 58236 0.424323
3 58237 0.424323
4 58238 NaN
5 58239 NaN
6 58240 NaN
7 58241 NaN
8 58242 NaN
9 58245 NaN
10 58246 1.483380
11 58247 1.483380
In [286]: out
Out[286]: [(58238, 6)]
With thresh = 2
, we have -
In [288]: out
Out[288]: [(58234, 2), (58238, 6)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With