I have the following pandas dataframe containing 2 columns (simplified). The first column contains player names and the second column contains dates (datetime
objects):
player date
A 2010-01-01
A 2010-01-09
A 2010-01-11
A 2010-01-15
B 2010-02-01
B 2010-02-10
B 2010-02-21
B 2010-02-23
I want to add a column diff which represents the time difference in days per player. The result should look like this:
player date diff
A 2010-01-01 0
A 2010-01-09 8
A 2010-01-11 2
A 2010-01-15 4
B 2010-02-01 0
B 2010-02-10 9
B 2010-02-21 11
B 2010-02-23 2
The first row has 0
for diff, because there is no earlier date. The second row shows 8
, because the difference between 2010-01-01
and 2010-01-09
is eight days.
The problem is not calculating the day-difference between two datetime
objects. I am just not sure on how to add the new column. I know, that I have to make a groupby
first (df.groupby('player')
) and then use apply
(or maybe transform
?). However, I am stuck, because for calculating the difference, I need to refer to the previous row in the apply-function, and I don't know how to do that, if possible at all.
Thank you very much.
UPDATE:
After trying both proposed solutions below, I figured out that they did not work with my code. After much headache, I found out that my data had duplicate indices. So after I found out that I have duplicate indices, a simple df.reset_index()
solved my issue and the proposed solutions worked. Since both solutions work, but I can only mark one as correct, I will choose the more concise/shorter solution. Thanks to both of you, though!
During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.
You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.
loc and iloc are interchangeable when the labels of the DataFrame are 0-based integers.
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
You can simply write:
df['difference'] = df.groupby('player')['date'].diff().fillna(0)
This gives the new timedelta column with the correct values:
player date difference
0 A 2010-01-01 0 days
1 A 2010-01-09 8 days
2 A 2010-01-11 2 days
3 A 2010-01-15 4 days
4 B 2010-02-01 0 days
5 B 2010-02-10 9 days
6 B 2010-02-21 11 days
7 B 2010-02-23 2 days
(I've used the name "difference" instead of "diff" to distinguish the name from the method diff
.)
another way if you want to implement it manually is to do the following
def date_diff(df):
df['difference'] = df['date'] - df['date'].shift()
df['difference'].fillna(0 ,inplace = True)
return df
In [30]:
df_final = df.groupby(df['player']).apply(date_diff)
df_final
Out[30]:
player date difference
A 2010-01-01 0 days
A 2010-01-09 8 days
A 2010-01-11 2 days
A 2010-01-15 4 days
B 2010-02-01 0 days
B 2010-02-10 9 days
B 2010-02-21 11 days
B 2010-02-23 2 days
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With