Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Consecutive NaN larger than threshold in Pandas DataFrame

I would like to find those indexes of consecutive NaN in a Pandas DataFrame with more than 3 consecutive NaN returning their size. That is:

58234         NaN
58235         NaN
58236    0.424323
58237    0.424323
58238         NaN
58239         NaN
58240         NaN
58241         NaN
58242         NaN
58245         NaN
58246    1.483380
58247    1.483380

Should return something like (58238, 6). The actual format of the return doesn't matter too much. I have found the following.

df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()

But it is not returning the right values per index. This question might be very similar to Identifying consecutive NaN's with pandas but any help would very appreciated as I am a total noob in Pandas.

like image 726
Noque Avatar asked Sep 01 '17 20:09

Noque


People also ask

Does pandas quantile ignore NaN?

Yes. pd. quantile() will ignore NaN values when calculating the quantile.

How do you tell the difference between consecutive rows in pandas?

diff() function. This function calculates the difference between two consecutive DataFrame elements.


2 Answers

I broke down the steps:

df['Group']=df.a.notnull().astype(int).cumsum()
df=df[df.a.isnull()]
df=df[df.Group.isin(df.Group.value_counts()[df.Group.value_counts()>3].index)]
df['count']=df.groupby('Group')['Group'].transform('size')
df.drop_duplicates(['Group'],keep='first')
Out[734]: 
        a  Group  count
ID                     
58238 NaN      2      6
like image 93
BENY Avatar answered Oct 19 '22 16:10

BENY


Assuming df to have those as two columns named : A, B, here's one vectorized approach -

thresh = 3

a = df.A.values
b = df.B.values

idx0 = np.flatnonzero(np.r_[True, np.diff(np.isnan(b))!=0,True])
count = np.diff(idx0)
idx = idx0[:-1]
valid_mask = (count>=thresh) & np.isnan(b[idx])
out_idx = idx[valid_mask]
out_num = a[out_idx]
out_count = count[valid_mask]
out = zip(out_num, out_count)

Sample input, output -

In [285]: df
Out[285]: 
        A         B
0   58234       NaN
1   58235       NaN
2   58236  0.424323
3   58237  0.424323
4   58238       NaN
5   58239       NaN
6   58240       NaN
7   58241       NaN
8   58242       NaN
9   58245       NaN
10  58246  1.483380
11  58247  1.483380

In [286]: out
Out[286]: [(58238, 6)]

With thresh = 2, we have -

In [288]: out
Out[288]: [(58234, 2), (58238, 6)]
like image 41
Divakar Avatar answered Oct 19 '22 15:10

Divakar