Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: Fill missing dates when keeping duplicates

Tags:

python

pandas

I have a simple pandas series:

import pandas as pd

quantities = [1, 14, 14, 11, 12, 13, 14]
timestamps = [pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 2), pd.Timestamp(2015, 4, 3), pd.Timestamp(2015, 4, 4), pd.Timestamp(2015, 4, 5), pd.Timestamp(2015, 4, 8)]
series = pd.Series(quantities, index=timestamps)

which looks as follows:

2015-04-01     1
2015-04-01    14
2015-04-02    14
2015-04-03    11
2015-04-04    12
2015-04-05    13
2015-04-08    14
dtype: int64

I would like to fill the missing dates i.e. 2015-04-06 = NaN and 2015-04-07 = NaN but keep the series as is, i.e.:

2015-04-01     1
2015-04-01    14
2015-04-02    14
2015-04-03    11
2015-04-04    12
2015-04-05    13
2015-04-06    NaN
2015-04-07    NaN
2015-04-08    14
dtype: int64

I've tried:

series = series.asfreq('D')

but get the following error: ValueError: cannot reindex from a duplicate axis. This error happens because of the duplicate timestamp values.

Is there any way on Earth to achieve this?

Thanks for any help.

like image 968
alexjrlewis Avatar asked Apr 02 '19 17:04

alexjrlewis


People also ask

How do you fill in missing dates in an irregular series in Python?

To add missing dates to Python Pandas DataFrame, we can use the DatetimeIndex instance's reindex method. We create a date range index with idx = pd. date_range('09-01-2020', '09-30-2020') .


2 Answers

Let's try:

s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()

Output:

2015-04-01     1.0
2015-04-01    14.0
2015-04-02    14.0
2015-04-03    11.0
2015-04-04    12.0
2015-04-05    13.0
2015-04-06     NaN
2015-04-07     NaN
2015-04-08    14.0
dtype: float64

Timings:

%%timeit
temp = series[~series.index.duplicated(keep='first')].asfreq('D')
pd.concat([series, temp.loc[~temp.index.isin(series.index)]]).sort_index()

2.51 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated. Use pandas.date_range instead.

2.07 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()

1.86 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Thanks @root for this suggestion.

%%timeit
s = pd.Series(index=pd.date_range(series.index.min(), series.index.max(), freq='D')\
                      .difference(series.index))
pd.concat([series,s]).sort_index()

1.55 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

like image 134
Scott Boston Avatar answered Oct 07 '22 01:10

Scott Boston


This should be enough, assuming you don't have millions of rows:

series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)

Output:

               x
2015-04-01   1.0
2015-04-01  14.0
2015-04-02  14.0
2015-04-03  11.0
2015-04-04  12.0
2015-04-05  13.0
2015-04-06   NaN
2015-04-07   NaN
2015-04-08  14.0

If you want a series you can access the column x of the resulting DataFrame: calendar.join(series).x

like image 1
arinarmo Avatar answered Oct 07 '22 00:10

arinarmo