I have a simple pandas
series:
import pandas as pd
quantities = [1, 14, 14, 11, 12, 13, 14]
timestamps = [pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 2), pd.Timestamp(2015, 4, 3), pd.Timestamp(2015, 4, 4), pd.Timestamp(2015, 4, 5), pd.Timestamp(2015, 4, 8)]
series = pd.Series(quantities, index=timestamps)
which looks as follows:
2015-04-01 1
2015-04-01 14
2015-04-02 14
2015-04-03 11
2015-04-04 12
2015-04-05 13
2015-04-08 14
dtype: int64
I would like to fill the missing dates i.e. 2015-04-06 = NaN
and 2015-04-07 = NaN
but keep the series as is, i.e.:
2015-04-01 1
2015-04-01 14
2015-04-02 14
2015-04-03 11
2015-04-04 12
2015-04-05 13
2015-04-06 NaN
2015-04-07 NaN
2015-04-08 14
dtype: int64
I've tried:
series = series.asfreq('D')
but get the following error: ValueError: cannot reindex from a duplicate axis. This error happens because of the duplicate timestamp values.
Is there any way on Earth to achieve this?
Thanks for any help.
To add missing dates to Python Pandas DataFrame, we can use the DatetimeIndex instance's reindex method. We create a date range index with idx = pd. date_range('09-01-2020', '09-30-2020') .
Let's try:
s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()
Output:
2015-04-01 1.0
2015-04-01 14.0
2015-04-02 14.0
2015-04-03 11.0
2015-04-04 12.0
2015-04-05 13.0
2015-04-06 NaN
2015-04-07 NaN
2015-04-08 14.0
dtype: float64
Timings:
%%timeit
temp = series[~series.index.duplicated(keep='first')].asfreq('D')
pd.concat([series, temp.loc[~temp.index.isin(series.index)]]).sort_index()
2.51 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated. Use
pandas.date_range
instead.2.07 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()
1.86 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Thanks @root for this suggestion.
%%timeit
s = pd.Series(index=pd.date_range(series.index.min(), series.index.max(), freq='D')\
.difference(series.index))
pd.concat([series,s]).sort_index()
1.55 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This should be enough, assuming you don't have millions of rows:
series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)
Output:
x
2015-04-01 1.0
2015-04-01 14.0
2015-04-02 14.0
2015-04-03 11.0
2015-04-04 12.0
2015-04-05 13.0
2015-04-06 NaN
2015-04-07 NaN
2015-04-08 14.0
If you want a series you can access the column x of the resulting DataFrame: calendar.join(series).x
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With