Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby multiple fields then diff

So my dataframe looks like this:

         date    site country  score 0  2018-01-01  google      us    100 1  2018-01-01  google      ch     50 2  2018-01-02  google      us     70 3  2018-01-03  google      us     60 4  2018-01-02  google      ch     10 5  2018-01-01      fb      us     50 6  2018-01-02      fb      us     55 7  2018-01-03      fb      us    100 8  2018-01-01      fb      es    100 9  2018-01-02      fb      gb    100 

Each site has a different score depending on the country. I'm trying to find the 1/3/5-day difference of scores for each site/country combination.

Output should be:

          date    site country  score  diff 8  2018-01-01      fb      es    100   0.0 9  2018-01-02      fb      gb    100   0.0 5  2018-01-01      fb      us     50   0.0 6  2018-01-02      fb      us     55   5.0 7  2018-01-03      fb      us    100  45.0 1  2018-01-01  google      ch     50   0.0 4  2018-01-02  google      ch     10 -40.0 0  2018-01-01  google      us    100   0.0 2  2018-01-02  google      us     70 -30.0 3  2018-01-03  google      us     60 -10.0 

I first tried sorting by site/country/date, then grouping by site and country but I'm not able to wrap my head around getting a difference from a grouped object.

like image 416
Craig Avatar asked Jan 19 '18 18:01

Craig


People also ask

Can you use groupby with multiple columns in pandas?

groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

How do I sum multiple columns in groupby pandas?

Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.

How do I iterate over a group in pandas?

groupby() to Iterate over Data frame Groups. DataFrame. groupby() function in Python is used to split the data into groups based on some criteria.

What is the difference between groupby and Pivot_table in pandas?

What is the difference between the pivot_table and the groupby? The groupby method is generally enough for two-dimensional operations, but pivot_table is used for multi-dimensional grouping operations.


1 Answers

First, sort the DataFrame and then all you need is groupby.diff():

df = df.sort_values(by=['site', 'country', 'date'])  df['diff'] = df.groupby(['site', 'country'])['score'].diff().fillna(0)  df Out:           date    site country  score  diff 8  2018-01-01      fb      es    100   0.0 9  2018-01-02      fb      gb    100   0.0 5  2018-01-01      fb      us     50   0.0 6  2018-01-02      fb      us     55   5.0 7  2018-01-03      fb      us    100  45.0 1  2018-01-01  google      ch     50   0.0 4  2018-01-02  google      ch     10 -40.0 0  2018-01-01  google      us    100   0.0 2  2018-01-02  google      us     70 -30.0 3  2018-01-03  google      us     60 -10.0 

sort_values doesn't support arbitrary orderings. If you need to sort arbitrarily (google before fb for example) you need to store them in a collection and set your column as categorical. Then sort_values will respect the ordering you provided there.

like image 69
ayhan Avatar answered Sep 20 '22 23:09

ayhan