Python Pandas: Denormalize data from one data frame into another

Question

I have a Pandas data frame which you might describe as “normalized”. For display purposes, I want to “de-normalize” the data. That is, I want to take some data spread across multiple key values which I want to put on the same row in the output records. Some records need to be summed as they are combined. (Aside: if anyone has a better term for this than “denormalization”, please make an edit to this question, or say so in the comments.)

I am working with a pandas data frame with many columns, so I will show you a simplified version below.

The following code sets up a (nearly) normalized source data frame. (Note that I am looking for advice on the second code block, and this code block is just to provide some context.) Similar to my actual data, there are some duplications in the identifying data, and some numbers to be summed:

import pandas as pd

dates = pd.date_range('20170701', periods=21)
datesA1 = pd.date_range('20170701', periods=11)
datesB1 = pd.date_range('20170705', periods=9)
datesA2 = pd.date_range('20170708', periods=10)
datesB2 = pd.date_range('20170710', periods=11)
datesC1 = pd.date_range('20170701', periods=5)
datesC2 = pd.date_range('20170709', periods=9)

cols=['Date','Type','Count']

df_A1 = pd.DataFrame({'Date':datesA1,
                      'Type':'Apples',
                      'Count': np.random.randint(30,size=11)})
df_A2 = pd.DataFrame({'Date':datesA2,
                      'Type':'Apples',
                      'Count': np.random.randint(30,size=10)})
df_B1 = pd.DataFrame({'Date':datesB1,
                      'Type':'Berries',
                      'Count': np.random.randint(30,size=9)})
df_B2 = pd.DataFrame({'Date':datesB2,
                      'Type':'Berries',
                      'Count': np.random.randint(30,size=11)})
df_C1 = pd.DataFrame({'Date':datesC1,
                      'Type':'Canteloupes',
                      'Count': np.random.randint(30,size=5)})
df_C2 = pd.DataFrame({'Date':datesC2,
                      'Type':'Canteloupes',
                      'Count': np.random.randint(30,size=9)})

frames = [df_A1, df_A2, df_B1, df_B2, df_C1, df_C2]

dat_fra_source = pd.concat(frames)

Further, the following code achieves my intention. The source data frame has multiple rows per date and type of fruit (A, B, and C). The destination data has a single row per day, with a sum of A, B, and C.

dat_fra_dest = pd.DataFrame(0, index=dates, columns=['Apples','Berries','Canteloupes'])

for index,row in dat_fra_source.iterrows():
    dat_fra_dest.at[row['Date'],row['Type']]+=row['Count']

My question is if there is a cleaner way to do this: a way that doesn’t require the zero-initialization and/or a way that operates on the entire data frame instead of line-by-line. I am also skeptical that I have an efficient implementation. I’ll also note that while I am only dealing with “count” in the simplified example, I have additional columns in my real-world example. Think that for A, B, and C there is not only a count, but also a weight and a volume.

BENY · Accepted Answer

Option 1

dat_fra_source.groupby(['Date','Type']).sum().unstack().fillna(0)

Out[63]: 
            Count                    
Type       Apples Berries Canteloupes
Date                                 
2017-07-01   13.0     0.0        24.0
2017-07-02   18.0     0.0        16.0
2017-07-03   11.0     0.0        29.0
2017-07-04   13.0     0.0         7.0
2017-07-05   24.0    11.0        23.0
2017-07-06    6.0     4.0         0.0
2017-07-07   29.0    26.0         0.0
2017-07-08   31.0    19.0         0.0
2017-07-09   38.0    17.0        26.0
2017-07-10   57.0    54.0         1.0
2017-07-11    4.0    41.0        10.0
2017-07-12   16.0    28.0        23.0
2017-07-13   25.0    20.0        20.0
2017-07-14   19.0     6.0        15.0
2017-07-15    6.0    22.0         7.0
2017-07-16   16.0     0.0         5.0
2017-07-17   29.0     7.0         4.0
2017-07-18    0.0    21.0         0.0
2017-07-19    0.0    19.0         0.0
2017-07-20    0.0     8.0         0.0

Option 2

pd.pivot_table(dat_fra_source,index=['Date'],columns=['Type'],values='Count',aggfunc=sum).fillna(0)
Out[75]: 
Type        Apples  Berries  Canteloupes
Date                                    
2017-07-01    13.0      0.0         24.0
2017-07-02    18.0      0.0         16.0
2017-07-03    11.0      0.0         29.0
2017-07-04    13.0      0.0          7.0
2017-07-05    24.0     11.0         23.0
2017-07-06     6.0      4.0          0.0
2017-07-07    29.0     26.0          0.0
2017-07-08    31.0     19.0          0.0
2017-07-09    38.0     17.0         26.0
2017-07-10    57.0     54.0          1.0
2017-07-11     4.0     41.0         10.0
2017-07-12    16.0     28.0         23.0
2017-07-13    25.0     20.0         20.0
2017-07-14    19.0      6.0         15.0
2017-07-15     6.0     22.0          7.0
2017-07-16    16.0      0.0          5.0
2017-07-17    29.0      7.0          4.0
2017-07-18     0.0     21.0          0.0
2017-07-19     0.0     19.0          0.0
2017-07-20     0.0      8.0          0.0

And assuming you have columns vol and weight

dat_fra_source['vol']=2
dat_fra_source['weight']=2
dat_fra_source.groupby(['Date','Type']).apply(lambda x: sum(x['vol']*x['weight']*x['Count'])).unstack().fillna(0)
Out[88]: 
Type        Apples  Berries  Canteloupes
Date                                    
2017-07-01    52.0      0.0         96.0
2017-07-02    72.0      0.0         64.0
2017-07-03    44.0      0.0        116.0
2017-07-04    52.0      0.0         28.0
2017-07-05    96.0     44.0         92.0
2017-07-06    24.0     16.0          0.0
2017-07-07   116.0    104.0          0.0
2017-07-08   124.0     76.0          0.0
2017-07-09   152.0     68.0        104.0
2017-07-10   228.0    216.0          4.0
2017-07-11    16.0    164.0         40.0
2017-07-12    64.0    112.0         92.0
2017-07-13   100.0     80.0         80.0
2017-07-14    76.0     24.0         60.0
2017-07-15    24.0     88.0         28.0
2017-07-16    64.0      0.0         20.0
2017-07-17   116.0     28.0         16.0
2017-07-18     0.0     84.0          0.0
2017-07-19     0.0     76.0          0.0
2017-07-20     0.0     32.0          0.0

Scott Boston · Answer

Use pd.crosstab:

pd.crosstab(dat_fra_source['Date'],
            dat_fra_source['Type'],
            dat_fra_source['Count'],
            aggfunc='sum',
            dropna=False).fillna(0)

Output:

Type        Apples  Berries  Canteloupes
Date                                    
2017-07-01    19.0      0.0          4.0
2017-07-02    25.0      0.0          4.0
2017-07-03    11.0      0.0         26.0
2017-07-04    27.0      0.0          8.0
2017-07-05     8.0     18.0         12.0
2017-07-06    10.0     11.0          0.0
2017-07-07     6.0     17.0          0.0
2017-07-08    10.0      5.0          0.0
2017-07-09    51.0     25.0         16.0
2017-07-10    31.0     23.0         21.0
2017-07-11    35.0     40.0         10.0
2017-07-12    16.0     30.0          9.0
2017-07-13    13.0     23.0         20.0
2017-07-14    21.0     26.0         27.0
2017-07-15    20.0     17.0         19.0
2017-07-16    12.0      4.0          2.0
2017-07-17    27.0      0.0          5.0
2017-07-18     0.0      5.0          0.0
2017-07-19     0.0     26.0          0.0
2017-07-20     0.0      6.0          0.0

Python Pandas: Denormalize data from one data frame into another

Tags:

python

pandas

dataframe

M. K. Hunter

2 Answers

BENY

Scott Boston

Recent Activity

Donate For Us

Python Pandas: Denormalize data from one data frame into another

Tags:

python

pandas

dataframe

M. K. Hunter

2 Answers

BENY

Scott Boston

Related questions

Recent Activity

Donate For Us