I have time series data in the following format, where a value indicates an accumulated amount since the past recording. What I want to do is "spread" that accumulated amount over the past periods containing NaN so that this input:
s = pd.Series([0, 0, np.nan, np.nan, 75, np.nan, np.nan, np.nan, np.nan, 50],
pd.date_range(start="Jan 1 2016", end="Jan 10 2016", freq='D'))
2016-01-01 0.0
2016-01-02 0.0
2016-01-03 NaN
2016-01-04 NaN
2016-01-05 75.0
2016-01-06 NaN
2016-01-07 NaN
2016-01-08 NaN
2016-01-09 NaN
2016-01-10 50.0
Becomes this output:
2016-01-01 0.0
2016-01-02 0.0
2016-01-03 25.0
2016-01-04 25.0
2016-01-05 25.0
2016-01-06 10.0
2016-01-07 10.0
2016-01-08 10.0
2016-01-09 10.0
2016-01-10 10.0
Is there an idiomatic Pandas way to do this rather than just do a for loop over the data? I've tried a variety of things involving fillna
, dropna
, isnull
, doing shift
to check the next value, etc but I can't see how to put the pieces together.
This might work, for each chunk of missing values, create a group variable with cumsum
(from the end of the series) and then perform a grouped average operation on each chunk:
s.groupby(s.notnull()[::-1].cumsum()[::-1]).transform(lambda g: g[-1]/g.size)
#2016-01-01 0.0
#2016-01-02 0.0
#2016-01-03 25.0
#2016-01-04 25.0
#2016-01-05 25.0
#2016-01-06 10.0
#2016-01-07 10.0
#2016-01-08 10.0
#2016-01-09 10.0
#2016-01-10 10.0
#Freq: D, dtype: float64
Or another option:
s.groupby(s.shift().notnull().cumsum()).transform(lambda g: g[-1]/g.size)
#2016-01-01 0.0
#2016-01-02 0.0
#2016-01-03 25.0
#2016-01-04 25.0
#2016-01-05 25.0
#2016-01-06 10.0
#2016-01-07 10.0
#2016-01-08 10.0
#2016-01-09 10.0
#2016-01-10 10.0
#Freq: D, dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With