Suppose I have the following DataFrame: <pre class="prettyprint"><code>df = pd.DataFrame({'city': ['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'], 'year': [2013, 2014, 2016, 2015, 2016, 2013, 2016, 2017, 2018], 'value': [10, 12, 16, 20, 21, 11, 15, 13, 16]}) </code></pre> And I want to find, for each city and year, what was the percentage change of value compared to the year before. My final dataframe would be: <pre class="prettyprint"><code>city year value a 2013 NaN a 2014 0.20 a 2016 NaN b 2015 NaN b 2016 0.05 c 2013 NaN d 2016 NaN d 2017 -0.14 d 2018 0.23 </code></pre> I tried to use a group in city and then use apply but it didn't work: <pre class="prettyprint"><code>df.groupby('city').apply(lambda x: x.sort_values('year')['value'].pct_change()).reset_index() </code></pre> It didn't work because I couldn't get the year and also because this way I was considereing that I had all years for all cities, but that is not true. EDIT: I'm not very concerned with efficiency, so any solution that solves the problem is valid for me.

Let's try lazy <code>groupby()</code>, use <code>pct_change</code> for the changes and <code>diff</code> to detect year jump: <pre class="prettyprint"><code>groups = df.sort_values('year').groupby(['city']) df['pct_chg'] = (groups['value'].pct_change() .where(groups['year'].diff()==1) ) </code></pre> Output: <pre class="prettyprint"><code> city year value pct_chg 0 a 2013 10 NaN 1 a 2014 12 0.200000 2 a 2016 16 NaN 3 b 2015 20 NaN 4 b 2016 21 0.050000 5 c 2013 11 NaN 6 d 2016 15 NaN 7 d 2017 13 -0.133333 8 d 2018 16 0.230769 </code></pre>

Although @Quang's answer is much more elegantly written and concise, I just add another approach using indexing. <pre class="prettyprint"><code>sorted_df = df.sort_values(by=['city', 'year']) sorted_df.loc[((sorted_df.year.diff() == 1) & (sorted_df.city == sorted_df.city.shift(1))), 'pct_chg'] = sorted_df.value.pct_change() </code></pre> my approach is faster as you can see below run on your df, but the syntax is not as pretty. <pre class="prettyprint"><code>%timeit #mine 1.44 ms ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit #@Quang's 2.23 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre>

Pandas percentage change using group by

Tags:

python

pandas

dataframe

Suppose I have the following DataFrame:

df = pd.DataFrame({'city': ['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'], 
                   'year': [2013, 2014, 2016, 2015, 2016, 2013, 2016, 2017, 2018],
                  'value': [10, 12, 16, 20, 21, 11, 15, 13, 16]})

And I want to find, for each city and year, what was the percentage change of value compared to the year before. My final dataframe would be:

city  year  value
   a  2013    NaN
   a  2014   0.20
   a  2016    NaN
   b  2015    NaN
   b  2016   0.05
   c  2013    NaN
   d  2016    NaN
   d  2017  -0.14
   d  2018   0.23

I tried to use a group in city and then use apply but it didn't work:

df.groupby('city').apply(lambda x: x.sort_values('year')['value'].pct_change()).reset_index()

It didn't work because I couldn't get the year and also because this way I was considereing that I had all years for all cities, but that is not true.

EDIT: I'm not very concerned with efficiency, so any solution that solves the problem is valid for me.

728

asked May 20 '21 17:05

Bruno Mello

Video Answer

2 Answers

Let's try lazy groupby(), use pct_change for the changes and diff to detect year jump:

groups = df.sort_values('year').groupby(['city'])

df['pct_chg'] = (groups['value'].pct_change()
                    .where(groups['year'].diff()==1)
                )

Output:

  city  year  value   pct_chg
0    a  2013     10       NaN
1    a  2014     12  0.200000
2    a  2016     16       NaN
3    b  2015     20       NaN
4    b  2016     21  0.050000
5    c  2013     11       NaN
6    d  2016     15       NaN
7    d  2017     13 -0.133333
8    d  2018     16  0.230769

answered Nov 03 '22 13:11

Quang Hoang

Although @Quang's answer is much more elegantly written and concise, I just add another approach using indexing.

sorted_df = df.sort_values(by=['city', 'year'])
sorted_df.loc[((sorted_df.year.diff() == 1) & 
              (sorted_df.city == sorted_df.city.shift(1))), 'pct_chg'] = sorted_df.value.pct_change()

my approach is faster as you can see below run on your df, but the syntax is not as pretty.

%timeit #mine
1.44 ms ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit #@Quang's
2.23 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

answered Nov 03 '22 15:11

Sina Meftah

Related questions
                            
                                Python - LogReturn on an entire dataframe
                            
                                NSWindow drag regions should only be invalidated on the Main Thread! This will throw an exception in the future
                            
                                How to concatenate a vector into rows of a numpy matrix?
                            
                                Perform sum over different slice of each row for 2D array
                            
                                Simple way to delete existing pods from Python
                            
                                AttributeError: module 'google.cloud.vision' has no attribute 'types'
                            
                                Sendgrid Authenticate with API Keys
                            
                                Pytorch RuntimeError: expected scalar type Float but found Byte
                            
                                What exactly is Keras's CategoricalCrossEntropy doing?
                            
                                Python, Avoid ugly nested for loop
                            
                                Google Ads API - "failed with status "PERMISSION_DENIED" - "User doesn't have permission to access customer."
                            
                                Django: What's the difference between Queryset.union() and the OR operator?
                            
                                With BERT Text Classification, ValueError: too many dimensions 'str' error occuring
                            
                                Example code from typing library causes TypeError: 'type' object is not subscriptable, why?
                            
                                Python regex to match 6-digit numbers of different formats
                            
                                How to efficiently perform addition over large loops in python
                            
                                Getting ImportError when using torchtext
                            
                                ImportError: cannot import name '_ColumnEntity' Ubuntu20.10 [duplicate]
                            
                                Regex to group words separated by space
                            
                                Python: Sorting items from top left to bottom right with OpenCV

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With