Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python de-aggregation

Tags:

python

pandas

I have a data set which is aggregated between two dates and I want to de-aggregate it daily by dividing total number with days between these dates. As a sample

StoreID Date_Start    Date_End     Total_Number_of_sales
78       12/04/2015    17/05/2015    79089
80       12/04/2015    17/05/2015    79089

The data set I want is:

StoreID Date         Number_Sales 
78         12/04/2015    79089/38(as there are 38 days in between)
78         13/04/2015    79089/38(as there are 38 days in between) 
78         14/04/2015    79089/38(as there are 38 days in between)
78         ...
78         17/05/2015    79089/38(as there are 38 days in between)

Any help would be useful. Thanks

like image 790
emkay Avatar asked Oct 16 '22 14:10

emkay


2 Answers

I'm not sure if this is exactly what you want but you can try this (I've added another imaginary row):

import datetime as dt
df = pd.DataFrame({'date_start':['12/04/2015','17/05/2015'],
                   'date_end':['18/05/2015','10/06/2015'],
                   'sales':[79089, 1000]})

df['date_start'] = pd.to_datetime(df['date_start'], format='%d/%m/%Y')
df['date_end'] = pd.to_datetime(df['date_end'], format='%d/%m/%Y')
df['days_diff'] = (df['date_end'] - df['date_start']).dt.days


master_df = pd.DataFrame(None)
for row in df.index:
    new_df = pd.DataFrame(index=pd.date_range(start=df['date_start'].iloc[row],
                          end = df['date_end'].iloc[row],
                          freq='d'))
    new_df['number_sales'] = df['sales'].iloc[row] / df['days_diff'].iloc[row]
    master_df = pd.concat([master_df, new_df], axis=0)

First convert string dates to datetime objects (so you can calculate number of days in between ranges), then create a new index based on the date range, and divide sales. The loop sticks each row of your dataframe into an "expanded" dataframe and then concatenates them into one master dataframe.

like image 164
Dan Avatar answered Oct 21 '22 01:10

Dan


What about creating a new dataframe?

start = pd.to_datetime(df['Date_Start'].values[0], dayfirst=True)
end = pd.to_datetime(df['Date_End'].values[0], dayfirst=True)
idx = pd.DatetimeIndex(start=start, end=end, freq='D')
res = pd.DataFrame(df['Total_Number_of_sales'].values[0]/len(idx), index=idx, columns=['Number_Sales'])

yields

In[42]: res.head(5)
Out[42]: 
            Number_Sales
2015-04-12   2196.916667
2015-04-13   2196.916667
2015-04-14   2196.916667
2015-04-15   2196.916667
2015-04-16   2196.916667

If you have multiple stores (according to your comment and edit), then you could loop over all rows, calculate sales and concatenate the resulting dataframes afterwards.

df = pd.DataFrame({'Store_ID': [78, 78, 80],
    'Date_Start': ['12/04/2015', '18/05/2015', '21/06/2015'],
                   'Date_End': ['17/05/2015', '10/06/2015', '01/07/2015'],
                   'Total_Number_of_sales': [79089., 50000., 25000.]})

to_concat = []
for _, row in df.iterrows():
    start = pd.to_datetime(row['Date_Start'], dayfirst=True)
    end = pd.to_datetime(row['Date_End'], dayfirst=True)
    idx = pd.DatetimeIndex(start=start, end=end, freq='D')
    sales = [row['Total_Number_of_sales']/len(idx)] * len(idx)
    id = [row['Store_ID']] * len(idx)
    res = pd.DataFrame({'Store_ID': id, 'Number_Sales':sales}, index=idx)
    to_concat.append(res)

res = pd.concat(to_concat)

There are definitley more elegant solutions, have a look for example at this thread.

like image 33
chuni0r Avatar answered Oct 21 '22 00:10

chuni0r