I would like to understand the most compact way to replicate the following Stata command in Python 2.7 using pandas:
egen yr_id_sum = total(var_to_sum), missing by(id year)
.
I'd like to produce the yr_id_sum column in this table:
id year value yr_id_sum
1 2010 1 3
1 2010 2 3
1 2011 3 7
1 2011 4 7
2 2010 11 23
2 2010 12 23
2 2011 13 27
2 2011 14 27
I can do this for one grouping variable as follows (this may help clarify what I'm trying to do):
def add_mean(grp):
grp['ann_sum'] = grp['var_to_sum'].sum()
return grp
df=df.groupby('year').apply(add_sum)
This is equivalent to egen year_sum = total(var_to_sum), missing by(year)
.
I'm having difficulty with expanding answers like this about using sums with a multiindex to my case.
df.set_index(['year', 'id'], inplace=True)
df=df.groupby(['year', 'id').apply(add_sum)
Seems like it should do what I want it to... but I get Exception: cannot handle a non-unique multi-index!
Here are some of the answers that I've already looked at:
Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
To sum pandas DataFrame columns (given selected multiple columns) using either sum() , iloc[] , eval() and loc[] functions. Among these pandas DataFrame. sum() function returns the sum of the values for the requested axis, In order to calculate the sum of columns use axis=1 .
To create a new column for the output of groupby. sum(), we will first apply the groupby. sim() operation and then we will store this result in a new column.
To reproduce your desired output, you could use transform
: it takes the results of a groupby operation and broadcasts it back up to the original index. For example:
>>> df["yr_id_sum"] = df.groupby(["id", "year"])["value"].transform(sum)
>>> df
id year value yr_id_sum
0 1 2010 1 3
1 1 2010 2 3
2 1 2011 3 7
3 1 2011 4 7
4 2 2010 11 23
5 2 2010 12 23
6 2 2011 13 27
7 2 2011 14 27
which is basically
>>> df.groupby(["id", "year"])["value"].sum()
id year
1 2010 3
2011 7
2 2010 23
2011 27
Name: value, dtype: int64
but repeated to match the original columns being used as the index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With