I want to fill missing values in my pandas series, if there are less than 3 consecutive NANs.
Original series with missing values:
s=pd.Series(pd.np.random.randn(20))
s[[1,3,5,7,12,13,14,15, 18]]=pd.np.nan
Gives:
0 0.444025
1 NaN
2 0.631753
3 NaN
4 -0.577121
5 NaN
6 1.299953
7 NaN
8 -0.252173
9 0.287641
10 0.941953
11 -1.624728
12 NaN
13 NaN
14 NaN
15 NaN
16 0.998952
17 0.195698
18 NaN
19 -0.788995
BUT, using pandas.fillna() with a limit only fills the # of values specified (not number of CONSECUTIVE NANs, as expected):
s.fillna(value=0, limit=3) #Fails to fill values at position 7 and forward
Desired output would fill NANs with 0 at positions 1,3,5,7, and 18. It would leave series of 4 NaNs in place in position 12-15.
The documentation and other posts on SO have not resolved this issue (e.g. here). Documentation seems to imply that this limit will work on consecutive NANs, not the overall # in entire dataset that will be filled. Thanks!
We start with finding where the nan
values are via pd.Series.notna
.
As we use cumsum
, whenever we encounter a non-null value, we increment the cumulative sum this generating convenient groups for contiguous nan
values.
However, for all but the first group (and maybe the first group) we begin with a non-null value. So, I take the negation of mask
and sum the total number of null values within each group.
Now I fillna
and use pd.DataFrame.where
to mask the spots where the sum of nan
values was too much.
mask = s.notna()
c_na = (~mask).groupby(mask.cumsum()).transform('sum')
filled = s.fillna(0).where(c_na.le(3))
s.fillna(filled)
0 1.418895
1 0.000000
2 -0.553732
3 0.000000
4 -0.101532
5 0.000000
6 -1.334803
7 0.000000
8 1.159115
9 0.309093
10 -0.047970
11 0.051567
12 NaN
13 NaN
14 NaN
15 NaN
16 0.623673
17 -0.786857
18 0.000000
19 0.310688
dtype: float64
Here is a fancy Numpy/Pandas way using np.bincount
and pd.factorize
v = s.values
m = np.isnan(v)
f, u = pd.factorize((~m).cumsum())
filled = np.where(
~m, v,
np.where(np.bincount(f, weights=mask)[f] <= 3, 0, np.nan)
)
pd.Series(filled, s.index)
0 1.418895
1 0.000000
2 -0.553732
3 0.000000
4 -0.101532
5 0.000000
6 -1.334803
7 0.000000
8 1.159115
9 0.309093
10 -0.047970
11 0.051567
12 NaN
13 NaN
14 NaN
15 NaN
16 0.623673
17 -0.786857
18 0.000000
19 0.310688
dtype: float64
Maybe try this ?
t=s[s.isnull()];
v=pd.Series(t.index,index=t.index).diff().ne(1).cumsum();
z=v[v.isin(v.value_counts()[v.value_counts().gt(3)].index.values)];
s.fillna(0).mask(s.index.isin(z.index))
Out[348]:
0 -0.781728
1 0.000000
2 -1.114552
3 0.000000
4 1.242452
5 0.000000
6 0.599486
7 0.000000
8 0.757384
9 -1.559661
10 0.527451
11 -0.426890
12 NaN
13 NaN
14 NaN
15 NaN
16 -1.264962
17 0.703790
18 0.000000
19 0.953616
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With