Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - GroupBy and then Merge on original table

I'm trying to write a function to aggregate and perform various stats calcuations on a dataframe in Pandas and then merge it to the original dataframe however, I'm running to issues. This is code equivalent in SQL:

SELECT EID,        PCODE,        SUM(PVALUE) AS PVALUE,        SUM(SQRT(SC*EXP(SC-1))) AS SC,        SUM(SI) AS SI,        SUM(EE) AS EE INTO foo_bar_grp FROM foo_bar GROUP BY EID, PCODE  

And then join on the original table:

SELECT * FROM foo_bar_grp INNER JOIN  foo_bar ON foo_bar.EID = foo_bar_grp.EID          AND foo_bar.PCODE = foo_bar_grp.PCODE 

Here are the steps: Loading the data IN:>>

pol_dict = {'PID':[1,1,2,2],              'EID':[123,123,123,123],              'PCODE':['GU','GR','GU','GR'],              'PVALUE':[100,50,150,300],              'SI':[400,40,140,140],              'SC':[230,23,213,213],              'EE':[10000,10000,2000,30000],              }   pol_df = DataFrame(pol_dict)  pol_df 

OUT:>>

   EID    EE PCODE  PID  PVALUE   SC   SI 0  123  10000    GU    1     100  230  400 1  123  10000    GR    1      50   23   40 2  123   2000    GU    2     150  213  140 3  123  30000    GR    2     300  213  140 

Step 2: Calculating and Grouping on the data:

My pandas code is as follows:

#create aggregation dataframe poagg_df = pol_df del poagg_df['PID'] po_grouped_df = poagg_df.groupby(['EID','PCODE'])  #generate acc level aggregate acc_df = po_grouped_df.agg({     'PVALUE' : np.sum,     'SI' : lambda x: np.sqrt(np.sum(x * np.exp(x-1))),     'SC' : np.sum,     'EE' : np.sum }) 

This works fine until I want to join on the original table:

IN:>>

po_account_df = pd.merge(acc_df, po_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po')) 

OUT:>> KeyError: u'no item named EID'

For some reason, the grouped dataframe can't join back to the original table. I've looked at ways of trying to convert the groupby columns to actual columns but that doesn't seem to work.

Please note, the end goal is to be able to find the percentage for each column (PVALUE, SI, SC, EE) IE:

pol_acc_df['PVALUE_PCT'] = np.round(pol_acc_df.PVALUE_Po/pol_acc_df.PVALUE_Acc,4) 

Thanks!

like image 662
WalkingDeadFan Avatar asked Jul 27 '14 11:07

WalkingDeadFan


People also ask

How do I merge groupby in pandas?

Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.

Does groupby return a DataFrame or series?

So a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe. For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame.

What are the three phases of the pandas groupby () function?

The “group by” process: split-apply-combine (1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.

Does pandas groupby preserve order?

Groupby preserves the order of rows within each group.


2 Answers

By default, groupby output has the grouping columns as indicies, not columns, which is why the merge is failing.

There are a couple different ways to handle it, probably the easiest is using the as_index parameter when you define the groupby object.

po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False) 

Then, your merge should work as expected.

In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po')) Out[356]:     EID PCODE  SC_Acc  EE_Acc        SI_Acc  PVALUE_Acc  EE_Po  PVALUE_Po  \ 0  123    GR     236   40000  1.805222e+31         350  10000         50    1  123    GR     236   40000  1.805222e+31         350  30000        300    2  123    GU     443   12000  8.765549e+87         250  10000        100    3  123    GU     443   12000  8.765549e+87         250   2000        150        SC_Po  SI_Po   0     23     40   1    213    140   2    230    400   3    213    140   
like image 93
chrisb Avatar answered Sep 19 '22 08:09

chrisb


From the pandas docs:

Transformation: perform some group-specific computations and return a like-indexed object

Unfortunately, transform works series by series, so you wouldn't be able to perform multiple functions on multiple columns as you've done with agg, but transform does allow you to skip merge

po_grouped_df = pol_df.groupby(['EID','PCODE']) pol_df['sum_pval'] = po_grouped_df['PVALUE'].transform(sum) pol_df['func_si'] = po_grouped_df['SI'].transform(lambda x: np.sqrt(np.sum(x * np.exp(x-1)))) pol_df['sum_sc'] = po_grouped_df['SC'].transform(sum) pol_df['sum_ee'] = po_grouped_df['EE'].transform(sum) pol_df 

Results in:

PID EID PCODE   PVALUE  SI  SC  EE      sum_pval    func_si         sum_sc  sum_ee 1   123 GU      100     400 230 10000   250         8.765549e+87    443     12000 1   123 GR      50      40  23  10000   350         1.805222e+31    236     40000 2   123 GU      150     140 213 2000    250         8.765549e+87    443     12000 2   123 GR      300     140 213 30000   350         1.805222e+31    236     40000 

For more info, check out this SO answer.

like image 36
vyang Avatar answered Sep 21 '22 08:09

vyang