Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill dates on dataframe within groups with same ending

This is what I have:

df = pd.DataFrame({'item': [1,1,2,2,1,1],
                   'shop': ['A','A','A','A','B','B'],
                   'date': pd.to_datetime(['2018.01.'+ str(x) for x in [2,3,1,4,4,5]]),
                   'qty': [5,6,7,8,9,10]})
print(df)

   item shop       date  qty
0     1    A 2018-01-02    5
1     1    A 2018-01-03    6
2     2    A 2018-01-01    7
3     2    A 2018-01-04    8
4     1    B 2018-01-04    9
5     1    B 2018-01-05   10

This is what I want:

out = pd.DataFrame({'item': [1,1,1,1,2,2,2,2,2,1,1],
                   'shop': ['A','A','A','A','A','A','A','A','A','B','B'],
                   'date': pd.to_datetime(['2018.01.'+ str(x) for x in [2,3,4,5,1,2,3,4,5,4,5]]),
                   'qty': [5,6,0,0,7,0,0,8,0,9,10]})
print(out)

    item shop       date  qty
0      1    A 2018-01-02    5
1      1    A 2018-01-03    6
2      1    A 2018-01-04    0
3      1    A 2018-01-05    0
4      2    A 2018-01-01    7
5      2    A 2018-01-02    0
6      2    A 2018-01-03    0
7      2    A 2018-01-04    8
8      2    A 2018-01-05    0
9      1    B 2018-01-04    9
10     1    B 2018-01-05   10

This is what I achieved so far:

df.set_index('date').groupby(['item', 'shop']).resample("D")['qty'].sum().reset_index(name='qty')

   item shop       date  qty
0     1    A 2018-01-02    5
1     1    A 2018-01-03    6
2     1    B 2018-01-04    9
3     1    B 2018-01-05   10
4     2    A 2018-01-01    7
5     2    A 2018-01-02    0
6     2    A 2018-01-03    0
7     2    A 2018-01-04    8

I want to complete the missing dates (by day!) so that each group [item-shop] will end with the same date.

Ideas?

like image 676
Edo Avatar asked Dec 23 '22 15:12

Edo


2 Answers

The key here is create the min and max within different group , then we create the range and explode merge back

# find the min date for each shop under each item
s = df.groupby(['item','shop'])[['date']].min()
# find the global max
s['datemax'] = df['date'].max()
# combine two results 
s['date'] = [pd.date_range(x,y) for x , y in zip(s['date'],s['datemax'])]
out = s.explode('date').reset_index().merge(df,how='left').fillna(0)
out

    item shop       date    datemax   qty
0      1    A 2018-01-02 2018-01-05   5.0
1      1    A 2018-01-03 2018-01-05   6.0
2      1    A 2018-01-04 2018-01-05   0.0
3      1    A 2018-01-05 2018-01-05   0.0
4      1    B 2018-01-04 2018-01-05   9.0
5      1    B 2018-01-05 2018-01-05  10.0
6      2    A 2018-01-01 2018-01-05   7.0
7      2    A 2018-01-02 2018-01-05   0.0
8      2    A 2018-01-03 2018-01-05   0.0
9      2    A 2018-01-04 2018-01-05   8.0
10     2    A 2018-01-05 2018-01-05   0.0
like image 168
BENY Avatar answered Dec 28 '22 06:12

BENY


I think this gives you what you want (columns are ordered differently)

max_date = df.date.max()

def reindex_to_max_date(df):
    return df.set_index('date').reindex(pd.date_range(df.date.min(), max_date, name='date'), fill_value=0)

res = df.groupby(['shop', 'item']).apply(reindex_to_max_date)
res = res.qty.reset_index()

I grouped by shop, item to give the same sort order as you have in out but these can be swapped.

like image 26
JoeCondron Avatar answered Dec 28 '22 06:12

JoeCondron