I am working Time Series data. I am facing problem while removing consecutive NaNs less than or equal to threshold from a Data Frame column. I tried looking at some of the links like:
Identifying consecutive NaN's with pandas : Identifies where consecutive NaNs are present and what is count.
Pandas: run length of NaN holes : Outputs run Length encoding for NaNs
There are many more others along this lane, but none of them actually tells how can we remove them after identifying.
I found one similar solution but that is in R : How to remove more than 2 consecutive NA's in a column?
I want solution in Python.
So here is the example:
Here is my dataframe column:
a
0 36.45
1 35.45
2 NaN
3 NaN
4 NaN
5 37.21
6 35.63
7 36.45
8 34.65
9 31.45
10 NaN
11 NaN
12 36.71
13 35.55
14 NaN
15 NaN
16 NaN
17 NaN
18 37.71
If k = 3, my output should be:
a
0 36.45
1 35.45
2 37.21
3 35.63
4 36.45
5 34.65
6 31.45
7 36.71
8 35.55
9 NaN
10 NaN
11 NaN
12 NaN
13 37.71
How can I go about removing the consecutive NaNs less than or equal to some threshold (k).
There are a few ways, but this is how I've done it:
cumsum
trickgroupby
+ transform
to determine the size of each groupk = 3
i = df.a.isnull()
m = ~(df.groupby(i.ne(i.shift()).cumsum().values).a.transform('size').le(k) & i)
df[m]
a
0 36.45
1 35.45
5 37.21
6 35.63
7 36.45
8 34.65
9 31.45
12 36.71
13 35.55
14 NaN
15 NaN
16 NaN
17 NaN
18 37.71
You can perform df = df[m]; df.reset_index(drop=True)
step at the end if you want a monotonically increasing integer index.
You can create a indicator column to count the consecutive nans.
k = 3
(
df.groupby(pd.notna(df.a).cumsum())
.apply(lambda x: x.dropna() if pd.isna(x.a).sum() <= k else x)
.reset_index(drop=True)
)
Out[375]:
a
0 36.45
1 35.45
2 37.21
3 35.63
4 36.45
5 34.65
6 31.45
7 36.71
8 35.55
9 NaN
10 NaN
11 NaN
12 NaN
13 37.71
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With