Sum by year and id in Pandas

Tags:

python

pandas

I would like to understand the most compact way to replicate the following Stata command in Python 2.7 using pandas: egen yr_id_sum = total(var_to_sum), missing by(id year).

I'd like to produce the yr_id_sum column in this table:

id    year    value  yr_id_sum
1     2010    1      3
1     2010    2      3
1     2011    3      7
1     2011    4      7
2     2010    11     23
2     2010    12     23
2     2011    13     27
2     2011    14     27

I can do this for one grouping variable as follows (this may help clarify what I'm trying to do):

def add_mean(grp):
    grp['ann_sum'] = grp['var_to_sum'].sum()
    return grp

df=df.groupby('year').apply(add_sum)

This is equivalent to egen year_sum = total(var_to_sum), missing by(year).

I'm having difficulty with expanding answers like this about using sums with a multiindex to my case.

df.set_index(['year', 'id'], inplace=True)
df=df.groupby(['year', 'id').apply(add_sum)

Seems like it should do what I want it to... but I get Exception: cannot handle a non-unique multi-index!

Here are some of the answers that I've already looked at:

This question about applying a user defined function to each subgroup of a Group By in Pandas is close to what I am looking for.
I am trying to follow this question, with an unconditional sum.

893

asked Feb 11 '16 00:02

Arthur Morris

1 Answers

To reproduce your desired output, you could use transform: it takes the results of a groupby operation and broadcasts it back up to the original index. For example:

>>> df["yr_id_sum"] = df.groupby(["id", "year"])["value"].transform(sum)
>>> df
   id  year  value  yr_id_sum
0   1  2010      1          3
1   1  2010      2          3
2   1  2011      3          7
3   1  2011      4          7
4   2  2010     11         23
5   2  2010     12         23
6   2  2011     13         27
7   2  2011     14         27

which is basically

>>> df.groupby(["id", "year"])["value"].sum()
id  year
1   2010     3
    2011     7
2   2010    23
    2011    27
Name: value, dtype: int64

but repeated to match the original columns being used as the index.

127

answered Oct 21 '22 11:10

DSM

Related questions
                            
                                How do you use input function along with def function?
                            
                                Error when using classify in caffe
                            
                                Python Sqlite3: Create a schema without having to use a second database
                            
                                Render HTML tags from variable without escaping [duplicate]
                            
                                Does UUIDField's 'default' attribute takes care of the uniqueness?
                            
                                pandas - histogram from two columns?
                            
                                Trouble importing tabulate in Python 3.4
                            
                                Two's complement sign extension python?
                            
                                java.io.IOException: Cannot run program "python" using Spark in Pycharm (Windows)
                            
                                conditional class inheritance in python
                            
                                remove punctuation for each row in a pandas data frame [duplicate]
                            
                                No module named BeautifulSoup (but it should be installed) [duplicate]
                            
                                display a histogram with very non-uniform bin widths
                            
                                MultiPartParserError :- Invalid boundary
                            
                                type error 'class' object not callable
                            
                                BeautifulSoup: Get the class text
                            
                                Sqlacodegen generates mixed models and tables
                            
                                Pythonic way to "round()" like Javascript "Math.round()"?
                            
                                Parse ½ as 0.5 in Python 2.7
                            
                                Tensorflow embedding_lookup

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With