I have a simple pandas series:
import pandas as pd
quantities = [1, 14, 14, 11, 12, 13, 14]
timestamps = [pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 2), pd.Timestamp(2015, 4, 3), pd.Timestamp(2015, 4, 4), pd.Timestamp(2015, 4, 5), pd.Timestamp(2015, 4, 8)]
series = pd.Series(quantities, index=timestamps)
which looks as follows:
2015-04-01     1
2015-04-01    14
2015-04-02    14
2015-04-03    11
2015-04-04    12
2015-04-05    13
2015-04-08    14
dtype: int64
I would like to fill the missing dates i.e. 2015-04-06 = NaN and 2015-04-07 = NaN but keep the series as is, i.e.:
2015-04-01     1
2015-04-01    14
2015-04-02    14
2015-04-03    11
2015-04-04    12
2015-04-05    13
2015-04-06    NaN
2015-04-07    NaN
2015-04-08    14
dtype: int64
I've tried:
series = series.asfreq('D')
but get the following error: ValueError: cannot reindex from a duplicate axis. This error happens because of the duplicate timestamp values.
Is there any way on Earth to achieve this?
Thanks for any help.
To add missing dates to Python Pandas DataFrame, we can use the DatetimeIndex instance's reindex method. We create a date range index with idx = pd. date_range('09-01-2020', '09-30-2020') .
Let's try:
s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()
Output:
2015-04-01     1.0
2015-04-01    14.0
2015-04-02    14.0
2015-04-03    11.0
2015-04-04    12.0
2015-04-05    13.0
2015-04-06     NaN
2015-04-07     NaN
2015-04-08    14.0
dtype: float64
Timings:
%%timeit
temp = series[~series.index.duplicated(keep='first')].asfreq('D')
pd.concat([series, temp.loc[~temp.index.isin(series.index)]]).sort_index()
2.51 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated. Use
pandas.date_rangeinstead.2.07 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()
1.86 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Thanks @root for this suggestion.
%%timeit
s = pd.Series(index=pd.date_range(series.index.min(), series.index.max(), freq='D')\
                      .difference(series.index))
pd.concat([series,s]).sort_index()
1.55 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This should be enough, assuming you don't have millions of rows:
series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)
Output:
               x
2015-04-01   1.0
2015-04-01  14.0
2015-04-02  14.0
2015-04-03  11.0
2015-04-04  12.0
2015-04-05  13.0
2015-04-06   NaN
2015-04-07   NaN
2015-04-08  14.0
If you want a series you can access the column x of the resulting DataFrame: calendar.join(series).x
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With