I have a simple <code>pandas</code> series: <pre class="prettyprint"><code>import pandas as pd quantities = [1, 14, 14, 11, 12, 13, 14] timestamps = [pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 2), pd.Timestamp(2015, 4, 3), pd.Timestamp(2015, 4, 4), pd.Timestamp(2015, 4, 5), pd.Timestamp(2015, 4, 8)] series = pd.Series(quantities, index=timestamps) </code></pre> which looks as follows: <pre class="prettyprint"><code>2015-04-01 1 2015-04-01 14 2015-04-02 14 2015-04-03 11 2015-04-04 12 2015-04-05 13 2015-04-08 14 dtype: int64 </code></pre> I would like to fill the missing dates i.e. <code>2015-04-06 = NaN</code> and <code>2015-04-07 = NaN</code> but keep the series as is, i.e.: <pre class="prettyprint"><code>2015-04-01 1 2015-04-01 14 2015-04-02 14 2015-04-03 11 2015-04-04 12 2015-04-05 13 2015-04-06 NaN 2015-04-07 NaN 2015-04-08 14 dtype: int64 </code></pre> I've tried: <pre class="prettyprint"><code>series = series.asfreq('D') </code></pre> but get the following error: ValueError: cannot reindex from a duplicate axis. This error happens because of the duplicate timestamp values. Is there any way on Earth to achieve this? Thanks for any help.

Let's try: <pre class="prettyprint"><code>s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D')) pd.concat([series,s[~s.index.isin(series.index)]]).sort_index() </code></pre> Output: <pre class="prettyprint"><code>2015-04-01 1.0 2015-04-01 14.0 2015-04-02 14.0 2015-04-03 11.0 2015-04-04 12.0 2015-04-05 13.0 2015-04-06 NaN 2015-04-07 NaN 2015-04-08 14.0 dtype: float64 </code></pre> Timings: <pre class="prettyprint"><code>%%timeit temp = series[~series.index.duplicated(keep='first')].asfreq('D') pd.concat([series, temp.loc[~temp.index.isin(series.index)]]).sort_index() </code></pre> <blockquote> 2.51 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </blockquote> <pre class="prettyprint"><code>%%timeit series.name = "x" calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D')) calendar.join(series) </code></pre> <blockquote> C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated. Use <code>pandas.date_range</code> instead. 2.07 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </blockquote> <pre class="prettyprint"><code>%%timeit s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D')) pd.concat([series,s[~s.index.isin(series.index)]]).sort_index() </code></pre> <blockquote> 1.86 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) </blockquote> Thanks @root for this suggestion. <pre class="prettyprint"><code>%%timeit s = pd.Series(index=pd.date_range(series.index.min(), series.index.max(), freq='D')\ .difference(series.index)) pd.concat([series,s]).sort_index() </code></pre> <blockquote> 1.55 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) </blockquote>

This should be enough, assuming you don't have millions of rows: <pre class="prettyprint"><code>series.name = "x" calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D')) calendar.join(series) </code></pre> Output: <pre class="prettyprint"><code> x 2015-04-01 1.0 2015-04-01 14.0 2015-04-02 14.0 2015-04-03 11.0 2015-04-04 12.0 2015-04-05 13.0 2015-04-06 NaN 2015-04-07 NaN 2015-04-08 14.0 </code></pre> If you want a series you can access the column x of the resulting DataFrame: <code>calendar.join(series).x</code>

pandas: Fill missing dates when keeping duplicates

I have a simple pandas series:

import pandas as pd

quantities = [1, 14, 14, 11, 12, 13, 14]
timestamps = [pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 2), pd.Timestamp(2015, 4, 3), pd.Timestamp(2015, 4, 4), pd.Timestamp(2015, 4, 5), pd.Timestamp(2015, 4, 8)]
series = pd.Series(quantities, index=timestamps)

which looks as follows:

2015-04-01     1
2015-04-01    14
2015-04-02    14
2015-04-03    11
2015-04-04    12
2015-04-05    13
2015-04-08    14
dtype: int64

I would like to fill the missing dates i.e. 2015-04-06 = NaN and 2015-04-07 = NaN but keep the series as is, i.e.:

2015-04-01     1
2015-04-01    14
2015-04-02    14
2015-04-03    11
2015-04-04    12
2015-04-05    13
2015-04-06    NaN
2015-04-07    NaN
2015-04-08    14
dtype: int64

I've tried:

series = series.asfreq('D')

but get the following error: ValueError: cannot reindex from a duplicate axis. This error happens because of the duplicate timestamp values.

Is there any way on Earth to achieve this?

Thanks for any help.

How do you fill in missing dates in an irregular series in Python?

To add missing dates to Python Pandas DataFrame, we can use the DatetimeIndex instance's reindex method. We create a date range index with idx = pd. date_range('09-01-2020', '09-30-2020') .

Let's try:

s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()

Output:

2015-04-01     1.0
2015-04-01    14.0
2015-04-02    14.0
2015-04-03    11.0
2015-04-04    12.0
2015-04-05    13.0
2015-04-06     NaN
2015-04-07     NaN
2015-04-08    14.0
dtype: float64

Timings:

%%timeit
temp = series[~series.index.duplicated(keep='first')].asfreq('D')
pd.concat([series, temp.loc[~temp.index.isin(series.index)]]).sort_index()

2.51 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated. Use pandas.date_range instead.

2.07 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()

1.86 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Thanks @root for this suggestion.

%%timeit
s = pd.Series(index=pd.date_range(series.index.min(), series.index.max(), freq='D')\
                      .difference(series.index))
pd.concat([series,s]).sort_index()

1.55 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This should be enough, assuming you don't have millions of rows:

series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)

Output:

               x
2015-04-01   1.0
2015-04-01  14.0
2015-04-02  14.0
2015-04-03  11.0
2015-04-04  12.0
2015-04-05  13.0
2015-04-06   NaN
2015-04-07   NaN
2015-04-08  14.0

If you want a series you can access the column x of the resulting DataFrame: calendar.join(series).x

pandas: Fill missing dates when keeping duplicates

Tags:

python

pandas

alexjrlewis

People also ask

2 Answers

Scott Boston

arinarmo

Recent Activity

Donate For Us

pandas: Fill missing dates when keeping duplicates

Tags:

python

pandas

alexjrlewis

People also ask

2 Answers

Scott Boston

arinarmo

Related questions

Recent Activity

Donate For Us