Python de-aggregation

Question

I have a data set which is aggregated between two dates and I want to de-aggregate it daily by dividing total number with days between these dates. As a sample

StoreID Date_Start    Date_End     Total_Number_of_sales
78       12/04/2015    17/05/2015    79089
80       12/04/2015    17/05/2015    79089

The data set I want is:

StoreID Date         Number_Sales 
78         12/04/2015    79089/38(as there are 38 days in between)
78         13/04/2015    79089/38(as there are 38 days in between) 
78         14/04/2015    79089/38(as there are 38 days in between)
78         ...
78         17/05/2015    79089/38(as there are 38 days in between)

Any help would be useful. Thanks

Dan · Accepted Answer

I'm not sure if this is exactly what you want but you can try this (I've added another imaginary row):

import datetime as dt
df = pd.DataFrame({'date_start':['12/04/2015','17/05/2015'],
                   'date_end':['18/05/2015','10/06/2015'],
                   'sales':[79089, 1000]})

df['date_start'] = pd.to_datetime(df['date_start'], format='%d/%m/%Y')
df['date_end'] = pd.to_datetime(df['date_end'], format='%d/%m/%Y')
df['days_diff'] = (df['date_end'] - df['date_start']).dt.days


master_df = pd.DataFrame(None)
for row in df.index:
    new_df = pd.DataFrame(index=pd.date_range(start=df['date_start'].iloc[row],
                          end = df['date_end'].iloc[row],
                          freq='d'))
    new_df['number_sales'] = df['sales'].iloc[row] / df['days_diff'].iloc[row]
    master_df = pd.concat([master_df, new_df], axis=0)

First convert string dates to datetime objects (so you can calculate number of days in between ranges), then create a new index based on the date range, and divide sales. The loop sticks each row of your dataframe into an "expanded" dataframe and then concatenates them into one master dataframe.

chuni0r · Answer

What about creating a new dataframe?

start = pd.to_datetime(df['Date_Start'].values[0], dayfirst=True)
end = pd.to_datetime(df['Date_End'].values[0], dayfirst=True)
idx = pd.DatetimeIndex(start=start, end=end, freq='D')
res = pd.DataFrame(df['Total_Number_of_sales'].values[0]/len(idx), index=idx, columns=['Number_Sales'])

yields

In[42]: res.head(5)
Out[42]: 
            Number_Sales
2015-04-12   2196.916667
2015-04-13   2196.916667
2015-04-14   2196.916667
2015-04-15   2196.916667
2015-04-16   2196.916667

If you have multiple stores (according to your comment and edit), then you could loop over all rows, calculate sales and concatenate the resulting dataframes afterwards.

df = pd.DataFrame({'Store_ID': [78, 78, 80],
    'Date_Start': ['12/04/2015', '18/05/2015', '21/06/2015'],
                   'Date_End': ['17/05/2015', '10/06/2015', '01/07/2015'],
                   'Total_Number_of_sales': [79089., 50000., 25000.]})

to_concat = []
for _, row in df.iterrows():
    start = pd.to_datetime(row['Date_Start'], dayfirst=True)
    end = pd.to_datetime(row['Date_End'], dayfirst=True)
    idx = pd.DatetimeIndex(start=start, end=end, freq='D')
    sales = [row['Total_Number_of_sales']/len(idx)] * len(idx)
    id = [row['Store_ID']] * len(idx)
    res = pd.DataFrame({'Store_ID': id, 'Number_Sales':sales}, index=idx)
    to_concat.append(res)

res = pd.concat(to_concat)

There are definitley more elegant solutions, have a look for example at this thread.

Python de-aggregation

Tags:

python

pandas

emkay

2 Answers

Dan

chuni0r

Recent Activity

Donate For Us

Python de-aggregation

Tags:

python

pandas

emkay

2 Answers

Dan

chuni0r

Related questions

Recent Activity

Donate For Us