I have a Pandas data frame which you might describe as “normalized”. For display purposes, I want to “de-normalize” the data. That is, I want to take some data spread across multiple key values which I want to put on the same row in the output records. Some records need to be summed as they are combined. (Aside: if anyone has a better term for this than “denormalization”, please make an edit to this question, or say so in the comments.)
I am working with a pandas data frame with many columns, so I will show you a simplified version below.
The following code sets up a (nearly) normalized source data frame. (Note that I am looking for advice on the second code block, and this code block is just to provide some context.) Similar to my actual data, there are some duplications in the identifying data, and some numbers to be summed:
import pandas as pd
dates = pd.date_range('20170701', periods=21)
datesA1 = pd.date_range('20170701', periods=11)
datesB1 = pd.date_range('20170705', periods=9)
datesA2 = pd.date_range('20170708', periods=10)
datesB2 = pd.date_range('20170710', periods=11)
datesC1 = pd.date_range('20170701', periods=5)
datesC2 = pd.date_range('20170709', periods=9)
cols=['Date','Type','Count']
df_A1 = pd.DataFrame({'Date':datesA1,
'Type':'Apples',
'Count': np.random.randint(30,size=11)})
df_A2 = pd.DataFrame({'Date':datesA2,
'Type':'Apples',
'Count': np.random.randint(30,size=10)})
df_B1 = pd.DataFrame({'Date':datesB1,
'Type':'Berries',
'Count': np.random.randint(30,size=9)})
df_B2 = pd.DataFrame({'Date':datesB2,
'Type':'Berries',
'Count': np.random.randint(30,size=11)})
df_C1 = pd.DataFrame({'Date':datesC1,
'Type':'Canteloupes',
'Count': np.random.randint(30,size=5)})
df_C2 = pd.DataFrame({'Date':datesC2,
'Type':'Canteloupes',
'Count': np.random.randint(30,size=9)})
frames = [df_A1, df_A2, df_B1, df_B2, df_C1, df_C2]
dat_fra_source = pd.concat(frames)
Further, the following code achieves my intention. The source data frame has multiple rows per date and type of fruit (A, B, and C). The destination data has a single row per day, with a sum of A, B, and C.
dat_fra_dest = pd.DataFrame(0, index=dates, columns=['Apples','Berries','Canteloupes'])
for index,row in dat_fra_source.iterrows():
dat_fra_dest.at[row['Date'],row['Type']]+=row['Count']
My question is if there is a cleaner way to do this: a way that doesn’t require the zero-initialization and/or a way that operates on the entire data frame instead of line-by-line. I am also skeptical that I have an efficient implementation. I’ll also note that while I am only dealing with “count” in the simplified example, I have additional columns in my real-world example. Think that for A, B, and C there is not only a count, but also a weight and a volume.
Option 1
dat_fra_source.groupby(['Date','Type']).sum().unstack().fillna(0)
Out[63]:
Count
Type Apples Berries Canteloupes
Date
2017-07-01 13.0 0.0 24.0
2017-07-02 18.0 0.0 16.0
2017-07-03 11.0 0.0 29.0
2017-07-04 13.0 0.0 7.0
2017-07-05 24.0 11.0 23.0
2017-07-06 6.0 4.0 0.0
2017-07-07 29.0 26.0 0.0
2017-07-08 31.0 19.0 0.0
2017-07-09 38.0 17.0 26.0
2017-07-10 57.0 54.0 1.0
2017-07-11 4.0 41.0 10.0
2017-07-12 16.0 28.0 23.0
2017-07-13 25.0 20.0 20.0
2017-07-14 19.0 6.0 15.0
2017-07-15 6.0 22.0 7.0
2017-07-16 16.0 0.0 5.0
2017-07-17 29.0 7.0 4.0
2017-07-18 0.0 21.0 0.0
2017-07-19 0.0 19.0 0.0
2017-07-20 0.0 8.0 0.0
Option 2
pd.pivot_table(dat_fra_source,index=['Date'],columns=['Type'],values='Count',aggfunc=sum).fillna(0)
Out[75]:
Type Apples Berries Canteloupes
Date
2017-07-01 13.0 0.0 24.0
2017-07-02 18.0 0.0 16.0
2017-07-03 11.0 0.0 29.0
2017-07-04 13.0 0.0 7.0
2017-07-05 24.0 11.0 23.0
2017-07-06 6.0 4.0 0.0
2017-07-07 29.0 26.0 0.0
2017-07-08 31.0 19.0 0.0
2017-07-09 38.0 17.0 26.0
2017-07-10 57.0 54.0 1.0
2017-07-11 4.0 41.0 10.0
2017-07-12 16.0 28.0 23.0
2017-07-13 25.0 20.0 20.0
2017-07-14 19.0 6.0 15.0
2017-07-15 6.0 22.0 7.0
2017-07-16 16.0 0.0 5.0
2017-07-17 29.0 7.0 4.0
2017-07-18 0.0 21.0 0.0
2017-07-19 0.0 19.0 0.0
2017-07-20 0.0 8.0 0.0
And assuming you have columns vol and weight
dat_fra_source['vol']=2
dat_fra_source['weight']=2
dat_fra_source.groupby(['Date','Type']).apply(lambda x: sum(x['vol']*x['weight']*x['Count'])).unstack().fillna(0)
Out[88]:
Type Apples Berries Canteloupes
Date
2017-07-01 52.0 0.0 96.0
2017-07-02 72.0 0.0 64.0
2017-07-03 44.0 0.0 116.0
2017-07-04 52.0 0.0 28.0
2017-07-05 96.0 44.0 92.0
2017-07-06 24.0 16.0 0.0
2017-07-07 116.0 104.0 0.0
2017-07-08 124.0 76.0 0.0
2017-07-09 152.0 68.0 104.0
2017-07-10 228.0 216.0 4.0
2017-07-11 16.0 164.0 40.0
2017-07-12 64.0 112.0 92.0
2017-07-13 100.0 80.0 80.0
2017-07-14 76.0 24.0 60.0
2017-07-15 24.0 88.0 28.0
2017-07-16 64.0 0.0 20.0
2017-07-17 116.0 28.0 16.0
2017-07-18 0.0 84.0 0.0
2017-07-19 0.0 76.0 0.0
2017-07-20 0.0 32.0 0.0
Use pd.crosstab
:
pd.crosstab(dat_fra_source['Date'],
dat_fra_source['Type'],
dat_fra_source['Count'],
aggfunc='sum',
dropna=False).fillna(0)
Output:
Type Apples Berries Canteloupes
Date
2017-07-01 19.0 0.0 4.0
2017-07-02 25.0 0.0 4.0
2017-07-03 11.0 0.0 26.0
2017-07-04 27.0 0.0 8.0
2017-07-05 8.0 18.0 12.0
2017-07-06 10.0 11.0 0.0
2017-07-07 6.0 17.0 0.0
2017-07-08 10.0 5.0 0.0
2017-07-09 51.0 25.0 16.0
2017-07-10 31.0 23.0 21.0
2017-07-11 35.0 40.0 10.0
2017-07-12 16.0 30.0 9.0
2017-07-13 13.0 23.0 20.0
2017-07-14 21.0 26.0 27.0
2017-07-15 20.0 17.0 19.0
2017-07-16 12.0 4.0 2.0
2017-07-17 27.0 0.0 5.0
2017-07-18 0.0 5.0 0.0
2017-07-19 0.0 26.0 0.0
2017-07-20 0.0 6.0 0.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With