Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas percentage change using group by

Suppose I have the following DataFrame:

df = pd.DataFrame({'city': ['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'], 
                   'year': [2013, 2014, 2016, 2015, 2016, 2013, 2016, 2017, 2018],
                  'value': [10, 12, 16, 20, 21, 11, 15, 13, 16]})

And I want to find, for each city and year, what was the percentage change of value compared to the year before. My final dataframe would be:

city  year  value
   a  2013    NaN
   a  2014   0.20
   a  2016    NaN
   b  2015    NaN
   b  2016   0.05
   c  2013    NaN
   d  2016    NaN
   d  2017  -0.14
   d  2018   0.23

I tried to use a group in city and then use apply but it didn't work:

df.groupby('city').apply(lambda x: x.sort_values('year')['value'].pct_change()).reset_index()

It didn't work because I couldn't get the year and also because this way I was considereing that I had all years for all cities, but that is not true.

EDIT: I'm not very concerned with efficiency, so any solution that solves the problem is valid for me.

like image 728
Bruno Mello Avatar asked May 20 '21 17:05

Bruno Mello


People also ask

How do you get percentage on Groupby pandas?

You can caluclate pandas percentage with total by groupby() and DataFrame. transform() method. The transform() method allows you to execute a function for each value of the DataFrame. Here, the percentage directly summarized DataFrame, then the results will be calculated using all the data.

How do you calculate percentage change in pandas?

The pct_change() method returns a DataFrame with the percentage difference between the values for each row and, by default, the previous row. Which row to compare with can be specified with the periods parameter.

How do you show percentage increase in python?

pct_change() function calculates the percentage change between the current and a prior element. This function by default calculates the percentage change from the immediately previous row.


Video Answer


2 Answers

Let's try lazy groupby(), use pct_change for the changes and diff to detect year jump:

groups = df.sort_values('year').groupby(['city'])

df['pct_chg'] = (groups['value'].pct_change()
                    .where(groups['year'].diff()==1)
                )

Output:

  city  year  value   pct_chg
0    a  2013     10       NaN
1    a  2014     12  0.200000
2    a  2016     16       NaN
3    b  2015     20       NaN
4    b  2016     21  0.050000
5    c  2013     11       NaN
6    d  2016     15       NaN
7    d  2017     13 -0.133333
8    d  2018     16  0.230769
like image 87
Quang Hoang Avatar answered Nov 03 '22 13:11

Quang Hoang


Although @Quang's answer is much more elegantly written and concise, I just add another approach using indexing.

sorted_df = df.sort_values(by=['city', 'year'])
sorted_df.loc[((sorted_df.year.diff() == 1) & 
              (sorted_df.city == sorted_df.city.shift(1))), 'pct_chg'] = sorted_df.value.pct_change()

my approach is faster as you can see below run on your df, but the syntax is not as pretty.

%timeit #mine
1.44 ms ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit #@Quang's
2.23 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 36
Sina Meftah Avatar answered Nov 03 '22 15:11

Sina Meftah