Here is a sample of the input pandas dataframe:
**LastUpdate** **Whatever** ...
2017-12-30 xxx ...
2017-12-30 yyy ...
2017-12-30 zzz ...
2018-01-01 yyy ...
2018-01-03 zzz ...
Here is the expected DF (output):
**LastUpdate** **Whatever** ...
2017-12-30 xxx ...
2017-12-30 yyy ...
2017-12-30 zzz ...
2017-12-31 xxx ...
2017-12-31 yyy ...
2017-12-31 zzz ...
2018-01-01 yyy ...
2018-01-02 yyy ...
2018-01-03 zzz ...
As you can see, the missing days in the data will simply duplicate previous day's rows so that I'm simply filling the missing days with (all) previous day data. The thing is that the number of rows per day might differ, so that's not really helping.
Important note: there may be more than only a day missing between two days (it could go from 2018-01-01 to 2018-01-05 so I would need to add all the missing days between these two days with the same data (with the exact same number of rows/content) as for the 2018-01-01, being the last day with data available.
I've made some research and came up with the resample, ffill and reset_index methods but it looks like it won't fit my specific case as it requires a unique date index, which is not the case here as one day may have several rows associated.
What I've tried so far:
df['Last Update'] = pd.to_datetime(df['Last Update'])
df.set_index("Last Update", inplace=True)
dfResult = df.resample('D').ffill().reset_index()
which yields cannot reindex a non-unique index with a method or limit
(and that totally makes sense) but I really can't figure out a way to achieve what I'm trying to do.
Let me know if anything is unclear or if you need any more additional information, any help would be appreciated
# This solution should also work for multiple columns.
# Setup.
df['Whatever2'] = df['Whatever'].map({'xxx':'a', 'yyy':'b', 'zzz':'c'})
df
LastUpdate Whatever Whatever2
0 2017-12-30 xxx a
1 2017-12-30 yyy b
2 2017-12-30 zzz c
3 2018-01-01 yyy b
4 2018-01-05 zzz c
5 2018-01-06 xxx a
6 2018-01-06 xxx a
7 2018-01-09 yyy b
Use set_index
+ unstack
, then reindex
and stack
again.
# If required, convert "LastUpdate" to `datetime`.
# df['LastUpdate'] = pd.to_datetime(df['LastUpdate'], errors='coerce')
(df.set_index(['LastUpdate', df.groupby('LastUpdate').cumcount()])
.unstack(1, fill_value='')
.reindex(pd.date_range(df['LastUpdate'].min(), df['LastUpdate'].max()))
.ffill()
.replace('', np.nan)
.stack(1)
.reset_index(level=1, drop=True)
.rename_axis('LastUpdate').reset_index())
LastUpdate Whatever Whatever2
0 2017-12-30 xxx a
1 2017-12-30 yyy b
2 2017-12-30 zzz c
3 2017-12-31 xxx a
4 2017-12-31 yyy b
5 2017-12-31 zzz c
6 2018-01-01 yyy b
7 2018-01-02 yyy b
8 2018-01-03 yyy b
9 2018-01-04 yyy b
10 2018-01-05 zzz c
11 2018-01-06 xxx a
12 2018-01-06 xxx a
13 2018-01-07 xxx a
14 2018-01-07 xxx a
15 2018-01-08 xxx a
16 2018-01-08 xxx a
17 2018-01-09 yyy b
First, set the index. Use cumcount
to get a count of repeating dates. This is required to determine how many times new dates must be repeated.
df.groupby('LastUpdate').cumcount().to_numpy()
# array([0, 1, 2, 0, 0, 0, 1, 0])
df.set_index(['LastUpdate', df.groupby('LastUpdate').cumcount()])
Whatever Whatever2
LastUpdate
2017-12-30 0 xxx a
1 yyy b
2 zzz c
2018-01-01 0 yyy b
2018-01-05 0 zzz c
2018-01-06 0 xxx a
1 xxx a
2018-01-09 0 yyy b
Next, use unstack
. I use the fill_value=''
to act as a block for a coming step (forward-filling).
_.unstack(1, fill_value='')
Whatever Whatever2
0 1 2 0 1 2
LastUpdate
2017-12-30 xxx yyy zzz a b c
2018-01-01 yyy b
2018-01-05 zzz c
2018-01-06 xxx xxx a a
2018-01-09 yyy b
You can now use reindex
to include missing dates:
_.reindex(pd.date_range(df['LastUpdate'].min(), df['LastUpdate'].max()))
Whatever Whatever2
0 1 2 0 1 2
2017-12-30 xxx yyy zzz a b c
2017-12-31 NaN NaN NaN NaN NaN NaN
2018-01-01 yyy b
2018-01-02 NaN NaN NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN NaN NaN
2018-01-05 zzz c
2018-01-06 xxx xxx a a
2018-01-07 NaN NaN NaN NaN NaN NaN
2018-01-08 NaN NaN NaN NaN NaN NaN
2018-01-09 yyy b
Now, forward fill to assign ith data of yesterday to the corresponding position in the missing date.
_.ffill()
Whatever Whatever2
0 1 2 0 1 2
2017-12-30 xxx yyy zzz a b c
2017-12-31 xxx yyy zzz a b c
2018-01-01 yyy b
2018-01-02 yyy b
2018-01-03 yyy b
2018-01-04 yyy b
2018-01-05 zzz c
2018-01-06 xxx xxx a a
2018-01-07 xxx xxx a a
2018-01-08 xxx xxx a a
2018-01-09 yyy b
Replace the filler values with NaN, and stack
.
_.replace('', np.nan).stack(1)
Whatever Whatever2
2017-12-30 0 xxx a
1 yyy b
2 zzz c
2017-12-31 0 xxx a
1 yyy b
2 zzz c
2018-01-01 0 yyy b
2018-01-02 0 yyy b
2018-01-03 0 yyy b
2018-01-04 0 yyy b
2018-01-05 0 zzz c
2018-01-06 0 xxx a
1 xxx a
2018-01-07 0 xxx a
1 xxx a
2018-01-08 0 xxx a
1 xxx a
2018-01-09 0 yyy b
After that, it's cleaning up the index.
Here's how I did it. I'll use a slightly more complex example, which I extended from your sample input, in order to demonstrate that my approach satisfies all requirements:
df = pd.DataFrame(columns = ['LastUpdate', 'Whatever', 'Column2'],
data = [['2017-12-30', 'xxx', 'a'],
['2017-12-30', 'yyy', 'b'],
['2017-12-30', 'zzz', 'c'],
['2018-01-01', 'yyy', 'b'],
['2018-01-05', 'zzz', 'c'],
['2018-01-06', 'xxx', 'a'],
['2018-01-06', 'xxx', 'a'],
['2018-01-09', 'yyy', 'b']])
df
LastUpdate Whatever Column2
0 2017-12-30 xxx a
1 2017-12-30 yyy b
2 2017-12-30 zzz c
3 2018-01-01 yyy b
4 2018-01-05 zzz c
5 2018-01-06 xxx a
6 2018-01-06 xxx a
7 2018-01-09 yyy b
LastUpdate
column as the df's index and set the index type to a DatetimeIndex:df.set_index('LastUpdate', drop=True, inplace=True)
df.index = pd.to_datetime(df.index)
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
missing_dates = [i for i in all_days if i not in df.index]
new_dfs = []
most_recent = df.index[0]
for i in missing_dates:
if i-1 in df.index:
most_recent = i-1
to_insert = pd.DataFrame(df.loc[most_recent])
print(to_insert.shape)
print(to_insert)
if to_insert.shape[1] == 1: # Ensure new df's row-index contains the date if most recent non-missing date had only one row
to_insert = to_insert.T
shift_amt = i - most_recent
to_insert = to_insert.shift(shift_amt.days, freq='D')
new_dfs.append(to_insert)
for i in new_dfs:
top_idx = pd.date_range(df.index.min(), i.shift(-1, freq='D').index.min(), freq='D')
top = df.loc[top_idx]
bottom_len = len(df.index) - len(top)
bottom = df.iloc[-bottom_len:]
df = pd.concat([top, i, bottom])
The resulting dataframe looks like this. All missing dates, both single and consecutive, have been filled with row(s) identical to that/those belonging to the most recent non-missing date:
df
Whatever Column2
2017-12-30 xxx a
2017-12-30 yyy b
2017-12-30 zzz c
2017-12-31 xxx a
2017-12-31 yyy b
2017-12-31 zzz c
2018-01-01 yyy b
2018-01-02 yyy b
2018-01-03 yyy b
2018-01-04 yyy b
2018-01-05 zzz c
2018-01-06 xxx a
2018-01-06 xxx a
2018-01-07 xxx a
2018-01-07 xxx a
2018-01-08 xxx a
2018-01-08 xxx a
2018-01-09 yyy b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With