Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python / pandas find number of years between two dates

I have a pandas dataframe with two columns that contain dates. I want to know the number of years between the two dates while accounting for leap years.

The ideal solution would not assume that a year always has a certain number of days. Because years don't always have 365 days.

Sample data:

date_end    date_start
2010-02-09  1933-03-03
2010-03-19  1924-04-08
2010-04-19  1924-04-08
2010-09-06  1924-04-08
2010-09-24  1924-04-08
2010-01-09  1933-04-29
2010-02-26  1933-04-29
2010-01-31  1953-06-10
2010-07-07  1928-11-14
2010-12-01  1974-11-17

date_start and date_end are of the "datetime" dtype. I want a new column that is the number of years between the two dates. It is simple to get the number of days between the two dates (df['diff'] = df.date_end - df.date_start) but then I run into trouble, since the number of years that passed for a given number of days depends on "when" the days happened, because of leap years.

This is analogous to a person's age. I've tried adapting a number of solutions to similar questions, but many questions are about number of days or weeks between two dates. I already have a way of getting the number of years without accounting for leap years, but I want to be more correct than that.

like image 691
user139188 Avatar asked Mar 01 '17 00:03

user139188


2 Answers

Assuming you want to define a year as 365 days, then you can do this:

>> df
    date_end date_start  is_leapyear
0 2016-02-28 2015-02-28            0
1 2017-02-28 2016-02-28            1
2 2018-02-28 2017-02-28            0

>> df['diff_in_days'] = df['date_end'] - df['date_start']
>> df['diff_in_years'] = df["diff_in_days"] / timedelta(days=365)
>> print df[["date_end", "date_start", "diff_in_years"]]

>> df
    date_end date_start  is_leapyear  diff_in_years
0 2016-02-28 2015-02-28            0        1.00000
1 2017-02-28 2016-02-28            1        1.00274
2 2018-02-28 2017-02-28            0        1.00000

As you can see, on years with extra days (Feb 29) , more time has elapsed between dates. In your case this would be:

    date_end date_start  diff_in_years
0 2010-02-09 1933-03-03      76.991781
1 2010-03-19 1924-04-08      86.002740
2 2010-04-19 1924-04-08      86.087671
3 2010-09-06 1924-04-08      86.471233
4 2010-09-24 1924-04-08      86.520548
5 2010-01-09 1933-04-29      76.750685
6 2010-02-26 1933-04-29      76.882192
7 2010-01-31 1953-06-10      56.682192
8 2010-07-07 1928-11-14      81.698630
9 2010-12-01 1974-11-17      36.063014

On the other hand, if you just want the difference in years. i.e. subtracting the year in which the dates happened (regarless of when in the year that date happened). Then you can do this:

df['date_end_year'] = df.date_end.apply(lambda x: x.year)
df['date_start_year'] = df.date_start.apply(lambda x: x.year)
df['diff_in_years'] = df['date_end_year'] - df['date_start_year']
print df[["date_end", "date_start", "diff_in_years"]]

    date_end date_start  diff_in_years
0 2016-02-28 2015-02-28              1
1 2017-02-28 2016-02-28              1
2 2018-02-28 2017-02-28              1

In your case this would be:

    date_end date_start  diff_in_years
0 2010-02-09 1933-03-03             77
1 2010-03-19 1924-04-08             86
2 2010-04-19 1924-04-08             86
3 2010-09-06 1924-04-08             86
4 2010-09-24 1924-04-08             86
5 2010-01-09 1933-04-29             77
6 2010-02-26 1933-04-29             77
7 2010-01-31 1953-06-10             57
8 2010-07-07 1928-11-14             82
9 2010-12-01 1974-11-17             36
like image 129
dataflow Avatar answered Sep 19 '22 12:09

dataflow


At first, I tried with praveen's answer, but found one thing not so intuitive: If the start date belongs to a leap year and the end date does not, the result won't be an integer, although the month and day are same in both start and end date because the start date uses 366 and end date uses 365.

My Idea is to count the full years since the start date until before end date. Then, count the days between 'no. of full years after start date' and divide it by 365 normally or 366 if either no. of full years after start date or end date belong to leap year.

def num_years(stdt: pd.Timestamp, endt: pd.Timestamp):
    stYr = stdt.year
    stMon = stdt.month
    stDay = stdt.day
    enYr = endt.year

    rv = 0

    # num of full years 
    calcStdt = pd.Timestamp(year=enYr, month=stMon, day=stDay)
    if calcStdt > endt:
        calcStdt = pd.Timestamp(year=enYr - 1, month=stMon, day=stDay)
        rv = enYr - 1 - stYr
    else:
        rv = enYr - stYr

    # remaining period less than a year
    if (calcStdt.is_leap_year==True) | (endt.is_leap_year==True):
        rv = rv + (endt-calcStdt).days/366
    else:  
        rv = rv + (endt-calcStdt).days/365
    return rv

# Test
print(num_years(pd.Timestamp(year=2000, month=1, day=10), pd.Timestamp(year=2004, month=1, day=10)))
print(num_years(pd.Timestamp(year=2000, month=5, day=10), pd.Timestamp(year=2004, month=5, day=10)))
print(num_years(pd.Timestamp(year=2001, month=1, day=10), pd.Timestamp(year=2004, month=1, day=10)))
print(num_years(pd.Timestamp(year=2001, month=5, day=10), pd.Timestamp(year=2004, month=5, day=10)))
like image 44
Rich KS Avatar answered Sep 20 '22 12:09

Rich KS