Suppose I have the following DataFrame:
df = pd.DataFrame({'city': ['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'],
'year': [2013, 2014, 2016, 2015, 2016, 2013, 2016, 2017, 2018],
'value': [10, 12, 16, 20, 21, 11, 15, 13, 16]})
And I want to find, for each city and year, what was the percentage change of value compared to the year before. My final dataframe would be:
city year value
a 2013 NaN
a 2014 0.20
a 2016 NaN
b 2015 NaN
b 2016 0.05
c 2013 NaN
d 2016 NaN
d 2017 -0.14
d 2018 0.23
I tried to use a group in city and then use apply but it didn't work:
df.groupby('city').apply(lambda x: x.sort_values('year')['value'].pct_change()).reset_index()
It didn't work because I couldn't get the year and also because this way I was considereing that I had all years for all cities, but that is not true.
EDIT: I'm not very concerned with efficiency, so any solution that solves the problem is valid for me.
You can caluclate pandas percentage with total by groupby() and DataFrame. transform() method. The transform() method allows you to execute a function for each value of the DataFrame. Here, the percentage directly summarized DataFrame, then the results will be calculated using all the data.
The pct_change() method returns a DataFrame with the percentage difference between the values for each row and, by default, the previous row. Which row to compare with can be specified with the periods parameter.
pct_change() function calculates the percentage change between the current and a prior element. This function by default calculates the percentage change from the immediately previous row.
Let's try lazy groupby()
, use pct_change
for the changes and diff
to detect year jump:
groups = df.sort_values('year').groupby(['city'])
df['pct_chg'] = (groups['value'].pct_change()
.where(groups['year'].diff()==1)
)
Output:
city year value pct_chg
0 a 2013 10 NaN
1 a 2014 12 0.200000
2 a 2016 16 NaN
3 b 2015 20 NaN
4 b 2016 21 0.050000
5 c 2013 11 NaN
6 d 2016 15 NaN
7 d 2017 13 -0.133333
8 d 2018 16 0.230769
Although @Quang's answer is much more elegantly written and concise, I just add another approach using indexing.
sorted_df = df.sort_values(by=['city', 'year'])
sorted_df.loc[((sorted_df.year.diff() == 1) &
(sorted_df.city == sorted_df.city.shift(1))), 'pct_chg'] = sorted_df.value.pct_change()
my approach is faster as you can see below run on your df, but the syntax is not as pretty.
%timeit #mine
1.44 ms ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit #@Quang's
2.23 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With