Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Duplicating previous day rows for all missing dates dataframe

Here is a sample of the input pandas dataframe:

**LastUpdate**                         **Whatever**                 ...

2017-12-30                              xxx                          ...

2017-12-30                              yyy                          ...

2017-12-30                              zzz                          ...

2018-01-01                              yyy                          ...

2018-01-03                              zzz                          ...

Here is the expected DF (output):

**LastUpdate**                         **Whatever**                 ...

2017-12-30                              xxx                          ...

2017-12-30                              yyy                          ...

2017-12-30                              zzz                          ...

2017-12-31                              xxx                          ...

2017-12-31                              yyy                          ...

2017-12-31                              zzz                          ...

2018-01-01                              yyy                          ...

2018-01-02                              yyy                          ...

2018-01-03                              zzz                          ...

As you can see, the missing days in the data will simply duplicate previous day's rows so that I'm simply filling the missing days with (all) previous day data. The thing is that the number of rows per day might differ, so that's not really helping.

Important note: there may be more than only a day missing between two days (it could go from 2018-01-01 to 2018-01-05 so I would need to add all the missing days between these two days with the same data (with the exact same number of rows/content) as for the 2018-01-01, being the last day with data available.

I've made some research and came up with the resample, ffill and reset_index methods but it looks like it won't fit my specific case as it requires a unique date index, which is not the case here as one day may have several rows associated.

What I've tried so far:

df['Last Update'] = pd.to_datetime(df['Last Update'])
df.set_index("Last Update", inplace=True)
dfResult = df.resample('D').ffill().reset_index()

which yields cannot reindex a non-unique index with a method or limit (and that totally makes sense) but I really can't figure out a way to achieve what I'm trying to do. Let me know if anything is unclear or if you need any more additional information, any help would be appreciated

like image 670
Theo Babilon Avatar asked Jan 31 '19 00:01

Theo Babilon


2 Answers

Setup

# This solution should also work for multiple columns.
# Setup.
df['Whatever2'] = df['Whatever'].map({'xxx':'a', 'yyy':'b', 'zzz':'c'})
df

  LastUpdate Whatever Whatever2
0 2017-12-30      xxx         a
1 2017-12-30      yyy         b
2 2017-12-30      zzz         c
3 2018-01-01      yyy         b
4 2018-01-05      zzz         c
5 2018-01-06      xxx         a
6 2018-01-06      xxx         a
7 2018-01-09      yyy         b

Solution

Use set_index + unstack, then reindex and stack again.

# If required, convert "LastUpdate" to `datetime`.
# df['LastUpdate'] = pd.to_datetime(df['LastUpdate'], errors='coerce')

(df.set_index(['LastUpdate', df.groupby('LastUpdate').cumcount()])
   .unstack(1, fill_value='')
   .reindex(pd.date_range(df['LastUpdate'].min(), df['LastUpdate'].max()))
   .ffill()
   .replace('', np.nan)
   .stack(1)
   .reset_index(level=1, drop=True)
   .rename_axis('LastUpdate').reset_index())

   LastUpdate Whatever Whatever2
0  2017-12-30      xxx         a
1  2017-12-30      yyy         b
2  2017-12-30      zzz         c
3  2017-12-31      xxx         a
4  2017-12-31      yyy         b
5  2017-12-31      zzz         c
6  2018-01-01      yyy         b
7  2018-01-02      yyy         b
8  2018-01-03      yyy         b
9  2018-01-04      yyy         b
10 2018-01-05      zzz         c
11 2018-01-06      xxx         a
12 2018-01-06      xxx         a
13 2018-01-07      xxx         a
14 2018-01-07      xxx         a
15 2018-01-08      xxx         a
16 2018-01-08      xxx         a
17 2018-01-09      yyy         b

Details

First, set the index. Use cumcount to get a count of repeating dates. This is required to determine how many times new dates must be repeated.

df.groupby('LastUpdate').cumcount().to_numpy()
# array([0, 1, 2, 0, 0, 0, 1, 0])

df.set_index(['LastUpdate', df.groupby('LastUpdate').cumcount()])

             Whatever Whatever2
LastUpdate                     
2017-12-30 0      xxx         a
           1      yyy         b
           2      zzz         c
2018-01-01 0      yyy         b
2018-01-05 0      zzz         c
2018-01-06 0      xxx         a
           1      xxx         a
2018-01-09 0      yyy         b

Next, use unstack. I use the fill_value='' to act as a block for a coming step (forward-filling).

_.unstack(1, fill_value='')


           Whatever           Whatever2      
                  0    1    2         0  1  2
LastUpdate                                   
2017-12-30      xxx  yyy  zzz         a  b  c
2018-01-01      yyy                   b      
2018-01-05      zzz                   c      
2018-01-06      xxx  xxx              a  a   
2018-01-09      yyy                   b      

You can now use reindex to include missing dates:

_.reindex(pd.date_range(df['LastUpdate'].min(), df['LastUpdate'].max()))

           Whatever           Whatever2          
                  0    1    2         0    1    2
2017-12-30      xxx  yyy  zzz         a    b    c
2017-12-31      NaN  NaN  NaN       NaN  NaN  NaN
2018-01-01      yyy                   b          
2018-01-02      NaN  NaN  NaN       NaN  NaN  NaN
2018-01-03      NaN  NaN  NaN       NaN  NaN  NaN
2018-01-04      NaN  NaN  NaN       NaN  NaN  NaN
2018-01-05      zzz                   c          
2018-01-06      xxx  xxx              a    a     
2018-01-07      NaN  NaN  NaN       NaN  NaN  NaN
2018-01-08      NaN  NaN  NaN       NaN  NaN  NaN
2018-01-09      yyy                   b          

Now, forward fill to assign ith data of yesterday to the corresponding position in the missing date.

_.ffill()

           Whatever           Whatever2      
                  0    1    2         0  1  2
2017-12-30      xxx  yyy  zzz         a  b  c
2017-12-31      xxx  yyy  zzz         a  b  c
2018-01-01      yyy                   b      
2018-01-02      yyy                   b      
2018-01-03      yyy                   b      
2018-01-04      yyy                   b      
2018-01-05      zzz                   c      
2018-01-06      xxx  xxx              a  a   
2018-01-07      xxx  xxx              a  a   
2018-01-08      xxx  xxx              a  a   
2018-01-09      yyy                   b      

Replace the filler values with NaN, and stack.

_.replace('', np.nan).stack(1)

             Whatever Whatever2
2017-12-30 0      xxx         a
           1      yyy         b
           2      zzz         c
2017-12-31 0      xxx         a
           1      yyy         b
           2      zzz         c
2018-01-01 0      yyy         b
2018-01-02 0      yyy         b
2018-01-03 0      yyy         b
2018-01-04 0      yyy         b
2018-01-05 0      zzz         c
2018-01-06 0      xxx         a
           1      xxx         a
2018-01-07 0      xxx         a
           1      xxx         a
2018-01-08 0      xxx         a
           1      xxx         a
2018-01-09 0      yyy         b

After that, it's cleaning up the index.

like image 103
cs95 Avatar answered Nov 18 '22 19:11

cs95


Here's how I did it. I'll use a slightly more complex example, which I extended from your sample input, in order to demonstrate that my approach satisfies all requirements:

  • missing days in the data simply duplicate a previous day's row(s)
  • all consecutive missing days are filled with all the row(s) belonging to the most recent non-missing day
  • supports multiple columns
df = pd.DataFrame(columns = ['LastUpdate', 'Whatever', 'Column2'],
                  data = [['2017-12-30', 'xxx', 'a'],
                          ['2017-12-30', 'yyy', 'b'],                        
                          ['2017-12-30', 'zzz', 'c'],                        
                          ['2018-01-01', 'yyy', 'b'],                          
                          ['2018-01-05', 'zzz', 'c'],
                          ['2018-01-06', 'xxx', 'a'],
                          ['2018-01-06', 'xxx', 'a'],
                          ['2018-01-09', 'yyy', 'b']])

df
    LastUpdate   Whatever   Column2
0   2017-12-30   xxx        a
1   2017-12-30   yyy        b
2   2017-12-30   zzz        c
3   2018-01-01   yyy        b
4   2018-01-05   zzz        c
5   2018-01-06   xxx        a
6   2018-01-06   xxx        a
7   2018-01-09   yyy        b
  1. Set the LastUpdate column as the df's index and set the index type to a DatetimeIndex:
df.set_index('LastUpdate', drop=True, inplace=True)
df.index = pd.to_datetime(df.index)
  1. Create a daterange that includes all dates (both present and missing) in between the min and max of the original df's index.
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')

  1. Create a list of timestamps representing the dates missing from the original df's index:
missing_dates = [i for i in all_days if i not in df.index]

  1. Create a list of new dataframes for each missing date. Some of these will have multiple rows, and others will have a single row. Each dataframe will be indexed at a given missing date:
new_dfs = []
most_recent = df.index[0]
for i in missing_dates:
    if i-1 in df.index:
        most_recent = i-1
    to_insert = pd.DataFrame(df.loc[most_recent])
    print(to_insert.shape)
    print(to_insert)
    if to_insert.shape[1] == 1: # Ensure new df's row-index contains the date if most recent non-missing date had only one row 
        to_insert = to_insert.T
    shift_amt = i - most_recent
    to_insert = to_insert.shift(shift_amt.days, freq='D')
    new_dfs.append(to_insert)
  1. Final step. For each new dataframe to be inserted, we separate our original df into top and bottom halves, and use pd.concat to combine the top half, new dataframe for a missing date, and bottom half:
for i in new_dfs:
    top_idx = pd.date_range(df.index.min(), i.shift(-1, freq='D').index.min(), freq='D')
    top = df.loc[top_idx]
    bottom_len = len(df.index) - len(top)
    bottom = df.iloc[-bottom_len:]
    df = pd.concat([top, i, bottom])

The resulting dataframe looks like this. All missing dates, both single and consecutive, have been filled with row(s) identical to that/those belonging to the most recent non-missing date:

df

            Whatever   Column2
2017-12-30  xxx        a
2017-12-30  yyy        b
2017-12-30  zzz        c
2017-12-31  xxx        a
2017-12-31  yyy        b
2017-12-31  zzz        c
2018-01-01  yyy        b
2018-01-02  yyy        b
2018-01-03  yyy        b
2018-01-04  yyy        b
2018-01-05  zzz        c
2018-01-06  xxx        a
2018-01-06  xxx        a
2018-01-07  xxx        a
2018-01-07  xxx        a
2018-01-08  xxx        a
2018-01-08  xxx        a
2018-01-09  yyy        b
like image 38
James Dellinger Avatar answered Nov 18 '22 17:11

James Dellinger