How to sum in pandas by unique index in several columns?

Tags:

I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. Obviously most users have multiple records.

The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks.

The organization of the dataframe is as follows:

User_ID    Registration  Session      clicks
2349876    2012-02-22    2014-04-24   2 
1987293    2011-02-01    2013-05-03   1 
2234214    2012-07-22    2014-01-22   7 
9874452    2010-12-22    2014-08-22   2 
...

(There is also an index above beginning with 0, but one could set User_ID as the index.)

I would like to aggregate the total number of clicks by the user since Registration date. The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks".

User_ID    Total_Clicks
2349876    722 
1987293    341
2234214    220 
9874452    1405 
...

How does one do this in pandas? Is this done by .agg()? Each User_ID needs to be summed individually.

As there are 1.5 million records, does this scale?

499

asked Feb 10 '16 05:02

ShanZhengYang

1 Answers

IIUC you can use groupby, sum and reset_index:

print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2

print df.groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

If first column User_ID is index:

print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2014-01-22       7
9874452   2010-12-22 2014-08-22       2

print df.groupby(level=0)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

Or:

print df.groupby(df.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

EDIT:

As Alexander pointed, you need filter data before groupby, if Session dates is less as Registration dates per User_ID:

print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2

print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

I change 3. row of data for better sample:

print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2012-01-22       7
9874452   2010-12-22 2014-08-22       2

print df.Session >= df.Registration
User_ID
2349876     True
1987293     True
2234214    False
9874452     True
dtype: bool

print df[df.Session >= df.Registration]
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
9874452   2010-12-22 2014-08-22       2

df1 = df[df.Session >= df.Registration]
print df1.groupby(df1.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2349876       2
2  9874452       2

181

answered Sep 18 '22 16:09

jezrael

Related questions
                            
                                scrapy get the entire text including children
                            
                                Django transaction.atomic() guarantees atomic READ + WRITE?
                            
                                Python calculator - Implicit math module
                            
                                How to handle MySQL connection(s) with Python multithreading
                            
                                Reportlab: How to add a footer to a pdf file
                            
                                Celery not running (Permission Denied)
                            
                                How to perform under sampling in scikit learn?
                            
                                Render an editable table using Flask, Jinja2 templates, then process the form data returned
                            
                                virtualenv using incorrect sys.path
                            
                                How to use "INSERT" in psycopg2 connection pooling?
                            
                                Is there a faster way to clean out control characters in a file?
                            
                                Given a (python) selenium WebElement can I get the innerText?
                            
                                No module named sympy
                            
                                Why does slice [:-0] return empty list in Python
                            
                                Local variables in Python timeit setup [duplicate]
                            
                                Process.join() and queue don't work with large numbers [duplicate]
                            
                                django: 'python manage.py migrate' taking hours (and other weird behavior)
                            
                                Pandas new column from groupby averages
                            
                                Gunicorn Import by filename is not supported (module)
                            
                                How do I test exceptions and errors using pytest?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to sum in pandas by unique index in several columns?

Tags:

python

pandas

aggregate

sum

ShanZhengYang

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us