Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas Calculate average days between dates

Working with the following python pandas dataframe df:

Customer_ID | Transaction_ID
ABC            2016-05-06-1234
ABC            2017-06-08-3456
ABC            2017-07-12-5678
ABC            2017-12-20-6789
BCD            2016-08-23-7891
BCD            2016-09-21-2345
BCD            2017-10-23-4567

The date is unfortunately hidden in the transaction_id string. I edited the dataframe this way.

#year of transaction
df['year'] = df['Transaction_ID'].astype(str).str[:4]

#date of transaction
df['date'] = df['Transaction_ID'].astype(str).str[:10]

#format date
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d')

#calculate visit number per year
df['visit_nr_yr'] = df.groupby(['Customer_ID', 'year']).cumcount()+1

Now the df looks like this:

Customer_ID | Transaction_ID    | year  | date        |visit_nr_yr 
ABC            2016-05-06-1234    2016    2016-05-06    1            
ABC            2017-06-08-3456    2017    2017-06-08    1            
ABC            2017-07-12-5678    2017    2017-07-12    2            
ABC            2017-12-20-6789    2017    2017-12-20    3            
BCD            2016-08-23-7891    2016    2016-08-23    1            
BCD            2016-09-21-2345    2016    2016-09-21    2            
BCD            2017-10-23-4567    2017    2017-10-23    1            

I need to calculate the following:

  • What's the average days between visits by visit (so between 1&2 and between 2&3)
  • What's the average days between visits in general

First I would like to include the following column "days_between_visits_by year" (math to be done by Customer_ID):

Customer_ID|Transaction_ID  |year| date       |visit_nr_yr|days_bw_visits_yr 
ABC         2016-05-06-1234  2016  2016-05-06   1             NaN
ABC         2017-06-08-3456  2017  2017-06-08   1             NaN
ABC         2017-07-12-5678  2017  2017-07-12   2             34
ABC         2017-12-20-6789  2017  2017-12-20   3             161
BCD         2016-08-23-7891  2016  2016-08-23   1             NaN
BCD         2016-09-21-2345  2016  2016-09-21   2             29
BCD         2017-10-23-4567  2017  2017-10-23   1             NaN

Please note that I avoided 0s on purpose and kept the Nans, in case somebody had two visits on the same day.

Next I want to calculate the average days between visits by visit (so between 1&2 and between 2&3 within a year). Looking for this output:

avg_days_bw_visits_1_2 | avg_days_bw_visits_2_3
31.5                     161

Finally, I want to calculate the average days between visits in general:

output: 203.8 
#the days between visits are 398,34,161,29,397 and the average of those 
 numbers is 203.8

I'm stuck with at the how to create the column "days_bw_visits_yr". Nans have to be excluded from the math.

like image 968
jeangelj Avatar asked Jul 21 '17 15:07

jeangelj


People also ask

How do I calculate days between two dates in Python?

datetime() module Python has a built-in datetime module that assists us in resolving a number of date-related issues. We just input the two dates with the date type and subtract them to discover the difference between the two dates, which gives us the number of days between the two dates.

How do I calculate the difference between two dates in pandas?

Use df. dates1-df. dates2 to find the difference between the two dates and then convert the result in the form of months.

How does Python calculate average in pandas?

To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.

How do you find the date range in pandas?

In order to select rows between two dates in pandas DataFrame, first, create a boolean mask using mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date) to represent the start and end of the date range. Then you select the DataFrame that lies within the range using the DataFrame.


2 Answers

You can get previous visit date (grouped by customer and year) by shifting the "date" column down by 1:

df['previous_visit'] = df.groupby(['Customer_ID', 'year'])['date'].shift()

From this, days between visits is simply the difference:

df['days_bw_visits'] = df['date'] - df['previous_visit']

To calculate mean, convert the date delta object to days:

df['days_bw_visits'] = df['days_bw_visits'].apply(lambda x: x.days)

Average days between visits:

df.groupby('visit_nr_yr')['days_bw_visits'].agg('mean')

df['days_bw_visits'].mean()
like image 87
parasu Avatar answered Oct 13 '22 10:10

parasu


Source DF:

In [96]: df
Out[96]:
  Customer_ID   Transaction_ID
0         ABC  2016-05-06-1234
1         ABC  2017-06-08-3456
2         ABC  2017-07-12-5678
3         ABC  2017-12-20-6789
4         BCD  2016-08-23-7891
5         BCD  2016-09-21-2345
6         BCD  2017-10-23-4567

Solution:

df['Date'] = pd.to_datetime(df.Transaction_ID.str[:10])
df['visit_nr_yr'] = df.groupby(['Customer_ID', df['Date'].dt.year]).cumcount()+1
df['days_bw_visits_yr'] = \
    df.groupby(['Customer_ID', df['Date'].dt.year])['Date'].diff().dt.days

Result:

In [98]: df
Out[98]:
  Customer_ID   Transaction_ID       Date  visit_nr_yr  days_bw_visits_yr
0         ABC  2016-05-06-1234 2016-05-06            1                NaN
1         ABC  2017-06-08-3456 2017-06-08            1                NaN
2         ABC  2017-07-12-5678 2017-07-12            2               34.0
3         ABC  2017-12-20-6789 2017-12-20            3              161.0
4         BCD  2016-08-23-7891 2016-08-23            1                NaN
5         BCD  2016-09-21-2345 2016-09-21            2               29.0
6         BCD  2017-10-23-4567 2017-10-23            1                NaN
like image 41
MaxU - stop WAR against UA Avatar answered Oct 13 '22 11:10

MaxU - stop WAR against UA