Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas: Denormalize data from one data frame into another

I have a Pandas data frame which you might describe as “normalized”. For display purposes, I want to “de-normalize” the data. That is, I want to take some data spread across multiple key values which I want to put on the same row in the output records. Some records need to be summed as they are combined. (Aside: if anyone has a better term for this than “denormalization”, please make an edit to this question, or say so in the comments.)

I am working with a pandas data frame with many columns, so I will show you a simplified version below.

The following code sets up a (nearly) normalized source data frame. (Note that I am looking for advice on the second code block, and this code block is just to provide some context.) Similar to my actual data, there are some duplications in the identifying data, and some numbers to be summed:

import pandas as pd

dates = pd.date_range('20170701', periods=21)
datesA1 = pd.date_range('20170701', periods=11)
datesB1 = pd.date_range('20170705', periods=9)
datesA2 = pd.date_range('20170708', periods=10)
datesB2 = pd.date_range('20170710', periods=11)
datesC1 = pd.date_range('20170701', periods=5)
datesC2 = pd.date_range('20170709', periods=9)

cols=['Date','Type','Count']

df_A1 = pd.DataFrame({'Date':datesA1,
                      'Type':'Apples',
                      'Count': np.random.randint(30,size=11)})
df_A2 = pd.DataFrame({'Date':datesA2,
                      'Type':'Apples',
                      'Count': np.random.randint(30,size=10)})
df_B1 = pd.DataFrame({'Date':datesB1,
                      'Type':'Berries',
                      'Count': np.random.randint(30,size=9)})
df_B2 = pd.DataFrame({'Date':datesB2,
                      'Type':'Berries',
                      'Count': np.random.randint(30,size=11)})
df_C1 = pd.DataFrame({'Date':datesC1,
                      'Type':'Canteloupes',
                      'Count': np.random.randint(30,size=5)})
df_C2 = pd.DataFrame({'Date':datesC2,
                      'Type':'Canteloupes',
                      'Count': np.random.randint(30,size=9)})

frames = [df_A1, df_A2, df_B1, df_B2, df_C1, df_C2]

dat_fra_source = pd.concat(frames)

Further, the following code achieves my intention. The source data frame has multiple rows per date and type of fruit (A, B, and C). The destination data has a single row per day, with a sum of A, B, and C.

dat_fra_dest = pd.DataFrame(0, index=dates, columns=['Apples','Berries','Canteloupes'])

for index,row in dat_fra_source.iterrows():
    dat_fra_dest.at[row['Date'],row['Type']]+=row['Count']

My question is if there is a cleaner way to do this: a way that doesn’t require the zero-initialization and/or a way that operates on the entire data frame instead of line-by-line. I am also skeptical that I have an efficient implementation. I’ll also note that while I am only dealing with “count” in the simplified example, I have additional columns in my real-world example. Think that for A, B, and C there is not only a count, but also a weight and a volume.

like image 528
M. K. Hunter Avatar asked Oct 01 '17 02:10

M. K. Hunter


2 Answers

Option 1

dat_fra_source.groupby(['Date','Type']).sum().unstack().fillna(0)

Out[63]: 
            Count                    
Type       Apples Berries Canteloupes
Date                                 
2017-07-01   13.0     0.0        24.0
2017-07-02   18.0     0.0        16.0
2017-07-03   11.0     0.0        29.0
2017-07-04   13.0     0.0         7.0
2017-07-05   24.0    11.0        23.0
2017-07-06    6.0     4.0         0.0
2017-07-07   29.0    26.0         0.0
2017-07-08   31.0    19.0         0.0
2017-07-09   38.0    17.0        26.0
2017-07-10   57.0    54.0         1.0
2017-07-11    4.0    41.0        10.0
2017-07-12   16.0    28.0        23.0
2017-07-13   25.0    20.0        20.0
2017-07-14   19.0     6.0        15.0
2017-07-15    6.0    22.0         7.0
2017-07-16   16.0     0.0         5.0
2017-07-17   29.0     7.0         4.0
2017-07-18    0.0    21.0         0.0
2017-07-19    0.0    19.0         0.0
2017-07-20    0.0     8.0         0.0

Option 2

pd.pivot_table(dat_fra_source,index=['Date'],columns=['Type'],values='Count',aggfunc=sum).fillna(0)
Out[75]: 
Type        Apples  Berries  Canteloupes
Date                                    
2017-07-01    13.0      0.0         24.0
2017-07-02    18.0      0.0         16.0
2017-07-03    11.0      0.0         29.0
2017-07-04    13.0      0.0          7.0
2017-07-05    24.0     11.0         23.0
2017-07-06     6.0      4.0          0.0
2017-07-07    29.0     26.0          0.0
2017-07-08    31.0     19.0          0.0
2017-07-09    38.0     17.0         26.0
2017-07-10    57.0     54.0          1.0
2017-07-11     4.0     41.0         10.0
2017-07-12    16.0     28.0         23.0
2017-07-13    25.0     20.0         20.0
2017-07-14    19.0      6.0         15.0
2017-07-15     6.0     22.0          7.0
2017-07-16    16.0      0.0          5.0
2017-07-17    29.0      7.0          4.0
2017-07-18     0.0     21.0          0.0
2017-07-19     0.0     19.0          0.0
2017-07-20     0.0      8.0          0.0

And assuming you have columns vol and weight

dat_fra_source['vol']=2
dat_fra_source['weight']=2
dat_fra_source.groupby(['Date','Type']).apply(lambda x: sum(x['vol']*x['weight']*x['Count'])).unstack().fillna(0)
Out[88]: 
Type        Apples  Berries  Canteloupes
Date                                    
2017-07-01    52.0      0.0         96.0
2017-07-02    72.0      0.0         64.0
2017-07-03    44.0      0.0        116.0
2017-07-04    52.0      0.0         28.0
2017-07-05    96.0     44.0         92.0
2017-07-06    24.0     16.0          0.0
2017-07-07   116.0    104.0          0.0
2017-07-08   124.0     76.0          0.0
2017-07-09   152.0     68.0        104.0
2017-07-10   228.0    216.0          4.0
2017-07-11    16.0    164.0         40.0
2017-07-12    64.0    112.0         92.0
2017-07-13   100.0     80.0         80.0
2017-07-14    76.0     24.0         60.0
2017-07-15    24.0     88.0         28.0
2017-07-16    64.0      0.0         20.0
2017-07-17   116.0     28.0         16.0
2017-07-18     0.0     84.0          0.0
2017-07-19     0.0     76.0          0.0
2017-07-20     0.0     32.0          0.0
like image 146
BENY Avatar answered Oct 21 '22 08:10

BENY


Use pd.crosstab:

pd.crosstab(dat_fra_source['Date'],
            dat_fra_source['Type'],
            dat_fra_source['Count'],
            aggfunc='sum',
            dropna=False).fillna(0)

Output:

Type        Apples  Berries  Canteloupes
Date                                    
2017-07-01    19.0      0.0          4.0
2017-07-02    25.0      0.0          4.0
2017-07-03    11.0      0.0         26.0
2017-07-04    27.0      0.0          8.0
2017-07-05     8.0     18.0         12.0
2017-07-06    10.0     11.0          0.0
2017-07-07     6.0     17.0          0.0
2017-07-08    10.0      5.0          0.0
2017-07-09    51.0     25.0         16.0
2017-07-10    31.0     23.0         21.0
2017-07-11    35.0     40.0         10.0
2017-07-12    16.0     30.0          9.0
2017-07-13    13.0     23.0         20.0
2017-07-14    21.0     26.0         27.0
2017-07-15    20.0     17.0         19.0
2017-07-16    12.0      4.0          2.0
2017-07-17    27.0      0.0          5.0
2017-07-18     0.0      5.0          0.0
2017-07-19     0.0     26.0          0.0
2017-07-20     0.0      6.0          0.0
like image 28
Scott Boston Avatar answered Oct 21 '22 08:10

Scott Boston