I have a pandas dataframe with two columns that contain dates. I want to know the number of years between the two dates while accounting for leap years.
The ideal solution would not assume that a year always has a certain number of days. Because years don't always have 365 days.
Sample data:
date_end date_start
2010-02-09 1933-03-03
2010-03-19 1924-04-08
2010-04-19 1924-04-08
2010-09-06 1924-04-08
2010-09-24 1924-04-08
2010-01-09 1933-04-29
2010-02-26 1933-04-29
2010-01-31 1953-06-10
2010-07-07 1928-11-14
2010-12-01 1974-11-17
date_start
and date_end
are of the "datetime" dtype. I want a new column that is the number of years between the two dates. It is simple to get the number of days between the two dates (df['diff'] = df.date_end - df.date_start
) but then I run into trouble, since the number of years that passed for a given number of days depends on "when" the days happened, because of leap years.
This is analogous to a person's age. I've tried adapting a number of solutions to similar questions, but many questions are about number of days or weeks between two dates. I already have a way of getting the number of years without accounting for leap years, but I want to be more correct than that.
Assuming you want to define a year as 365 days, then you can do this:
>> df
date_end date_start is_leapyear
0 2016-02-28 2015-02-28 0
1 2017-02-28 2016-02-28 1
2 2018-02-28 2017-02-28 0
>> df['diff_in_days'] = df['date_end'] - df['date_start']
>> df['diff_in_years'] = df["diff_in_days"] / timedelta(days=365)
>> print df[["date_end", "date_start", "diff_in_years"]]
>> df
date_end date_start is_leapyear diff_in_years
0 2016-02-28 2015-02-28 0 1.00000
1 2017-02-28 2016-02-28 1 1.00274
2 2018-02-28 2017-02-28 0 1.00000
As you can see, on years with extra days (Feb 29) , more time has elapsed between dates. In your case this would be:
date_end date_start diff_in_years
0 2010-02-09 1933-03-03 76.991781
1 2010-03-19 1924-04-08 86.002740
2 2010-04-19 1924-04-08 86.087671
3 2010-09-06 1924-04-08 86.471233
4 2010-09-24 1924-04-08 86.520548
5 2010-01-09 1933-04-29 76.750685
6 2010-02-26 1933-04-29 76.882192
7 2010-01-31 1953-06-10 56.682192
8 2010-07-07 1928-11-14 81.698630
9 2010-12-01 1974-11-17 36.063014
On the other hand, if you just want the difference in years. i.e. subtracting the year in which the dates happened (regarless of when in the year that date happened). Then you can do this:
df['date_end_year'] = df.date_end.apply(lambda x: x.year)
df['date_start_year'] = df.date_start.apply(lambda x: x.year)
df['diff_in_years'] = df['date_end_year'] - df['date_start_year']
print df[["date_end", "date_start", "diff_in_years"]]
date_end date_start diff_in_years
0 2016-02-28 2015-02-28 1
1 2017-02-28 2016-02-28 1
2 2018-02-28 2017-02-28 1
In your case this would be:
date_end date_start diff_in_years
0 2010-02-09 1933-03-03 77
1 2010-03-19 1924-04-08 86
2 2010-04-19 1924-04-08 86
3 2010-09-06 1924-04-08 86
4 2010-09-24 1924-04-08 86
5 2010-01-09 1933-04-29 77
6 2010-02-26 1933-04-29 77
7 2010-01-31 1953-06-10 57
8 2010-07-07 1928-11-14 82
9 2010-12-01 1974-11-17 36
At first, I tried with praveen's answer, but found one thing not so intuitive: If the start date belongs to a leap year and the end date does not, the result won't be an integer, although the month and day are same in both start and end date because the start date uses 366 and end date uses 365.
My Idea is to count the full years since the start date until before end date. Then, count the days between 'no. of full years after start date' and divide it by 365 normally or 366 if either no. of full years after start date or end date belong to leap year.
def num_years(stdt: pd.Timestamp, endt: pd.Timestamp):
stYr = stdt.year
stMon = stdt.month
stDay = stdt.day
enYr = endt.year
rv = 0
# num of full years
calcStdt = pd.Timestamp(year=enYr, month=stMon, day=stDay)
if calcStdt > endt:
calcStdt = pd.Timestamp(year=enYr - 1, month=stMon, day=stDay)
rv = enYr - 1 - stYr
else:
rv = enYr - stYr
# remaining period less than a year
if (calcStdt.is_leap_year==True) | (endt.is_leap_year==True):
rv = rv + (endt-calcStdt).days/366
else:
rv = rv + (endt-calcStdt).days/365
return rv
# Test
print(num_years(pd.Timestamp(year=2000, month=1, day=10), pd.Timestamp(year=2004, month=1, day=10)))
print(num_years(pd.Timestamp(year=2000, month=5, day=10), pd.Timestamp(year=2004, month=5, day=10)))
print(num_years(pd.Timestamp(year=2001, month=1, day=10), pd.Timestamp(year=2004, month=1, day=10)))
print(num_years(pd.Timestamp(year=2001, month=5, day=10), pd.Timestamp(year=2004, month=5, day=10)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With