Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date

Tags:

I am facing the common task of calculating the age (in years, months, or weeks) given the date of birth and an arbitrary date. The thing is that quite often I have to do this over many many records (>300 millions), so performance is a key issue here.

After a quick search in SO and Google I found 3 alternatives:

  • A common arithmetic procedure (/365.25) (link)
  • Using functions new_interval() and duration() from package lubridate (link)
  • Function age_calc() from package eeptools (link, link, link)

So, here's my toy code:

# Some toy birthdates birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01",                         "1962-12-30", "1962-12-31", "1963-01-01",                         "2000-06-16", "2000-06-17", "2000-06-18",                         "2007-03-18", "2007-03-19", "2007-03-20",                         "1968-02-29", "1968-02-29", "1968-02-29"))  # Given dates to calculate the age givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31",                         "2015-12-31", "2015-12-31", "2015-12-31",                         "2050-06-17", "2050-06-17", "2050-06-17",                        "2008-03-19", "2008-03-19", "2008-03-19",                         "2015-02-28", "2015-03-01", "2015-03-02"))  # Using a common arithmetic procedure ("Time differences in days"/365.25) (givendate-birthdate)/365.25  # Use the package lubridate require(lubridate) new_interval(start = birthdate, end = givendate) /                       duration(num = 1, units = "years")  # Use the package eeptools library(eeptools) age_calc(dob = birthdate, enddate = givendate, units = "years") 

Let's talk later about accuracy and focus first on performance. Here's the code:

# Now let's compare the performance of the alternatives using microbenchmark library(microbenchmark) mbm <- microbenchmark(     arithmetic = (givendate - birthdate) / 365.25,     lubridate = new_interval(start = birthdate, end = givendate) /                                      duration(num = 1, units = "years"),     eeptools = age_calc(dob = birthdate, enddate = givendate,                          units = "years"),     times = 1000 )  # And examine the results mbm autoplot(mbm) 

Here the results:

Microbenchmark results - tableMicrobenchmark results - plot

Bottom line: performance of lubridate and eeptools functions is much worse than the arithmetic method (/365.25 is at least 10 times faster). Unfortunately, the arithmetic method is not accurate enough and I cannot afford the few mistakes that this method will make.

"because of the way the modern Gregorian calendar is constructed, there is no straightforward arithmetic method that produces a person’s age, stated according to common usage — common usage meaning that a person’s age should always be an integer that increases exactly on a birthday". (link)

As I read on some posts, lubridate and eeptools make no such mistakes (though, I haven't looked at the code/read more about those functions to know which method they use) and that's why I wanted to use them, but their performance does not work for my real application.

Any ideas on an efficient and accurate method to calculate the age?

EDIT

Ops, it seems lubridate also makes mistakes. And apparently based on this toy example, it makes more mistakes than the arithmetic method (see lines 3, 6, 9, 12). (am I doing something wrong?)

toy_df <- data.frame(     birthdate = birthdate,     givendate = givendate,     arithmetic = as.numeric((givendate - birthdate) / 365.25),     lubridate = new_interval(start = birthdate, end = givendate) /         duration(num = 1, units = "years"),     eeptools = age_calc(dob = birthdate, enddate = givendate,                         units = "years") ) toy_df[, 3:5] <- floor(toy_df[, 3:5]) toy_df      birthdate  givendate arithmetic lubridate eeptools 1  1978-12-30 2015-12-31         37        37       37 2  1978-12-31 2015-12-31         36        37       37 3  1979-01-01 2015-12-31         36        37       36 4  1962-12-30 2015-12-31         53        53       53 5  1962-12-31 2015-12-31         52        53       53 6  1963-01-01 2015-12-31         52        53       52 7  2000-06-16 2050-06-17         50        50       50 8  2000-06-17 2050-06-17         49        50       50 9  2000-06-18 2050-06-17         49        50       49 10 2007-03-18 2008-03-19          1         1        1 11 2007-03-19 2008-03-19          1         1        1 12 2007-03-20 2008-03-19          0         1        0 13 1968-02-29 2015-02-28         46        47       46 14 1968-02-29 2015-03-01         47        47       47 15 1968-02-29 2015-03-02         47        47       47 
like image 633
Hernando Casas Avatar asked Jun 29 '15 22:06

Hernando Casas


People also ask

How do I calculate my age accurately?

How is the age calculated ? Age is calculated by counting the number of years, months and days completed since birth. Leaps years and months with 31 days are all factored in the calculations. So, you can expect a very accurate calculation of the age down to the number of days.

How do I calculate age from date in R?

Age is extracted from Date_of_birth column using difftime() functions in roundabout way by extracting the number of weeks between date of birth and current date and dividing by 52.25, as shown below.

How do I calculate my exact age manually?

Age of a Person = Given date - Date of birth. Ron's Date of Birth = July 25, 1985. Given date = January 28, 2021. Years' Difference = 2020 - 1985 = 35 years.

How do you calculate age from birth year and year?

Ans: To find out a person's age, all you need is that person's year of birth. After this, all you need to do now is subtract the birth year from the ongoing current year and you will have the age. This will help you to calculate the age from the date of birth. Age= 2020- 1966 = 54.


1 Answers

The reason lubridate appears to be making mistakes above is that you are calculating duration (the exact amount of time that occurs between two instants, where 1 year = 31536000s), rather than periods (the change in clock time that occurs between two instants).

To get the change in clock time (in years, months, days, etc) you need to use

as.period(interval(start = birthdate, end = givendate)) 

which gives the following output

 "37y 0m 1d 0H 0M 0S"     "37y 0m 0d 0H 0M 0S"     "36y 11m 30d 0H 0M 0S"   ...  "46y 11m 30d 1H 0M 0S"   "47y 0m 0d 1H 0M 0S"     "47y 0m 1d 1H 0M 0S"  

To just extract years, you can use the following

as.period(interval(start = birthdate, end = givendate))$year  [1] 37 37 36 53 53 52 50 50 49  1  1  0 46 47 47 

Note sadly appears even slower than the methods above!

> mbm Unit: microseconds        expr       min        lq       mean    median         uq        max neval cld  arithmetic   116.595   138.149   181.7547   184.335   196.8565   5556.306  1000  a    lubridate 16807.683 17406.255 20388.1410 18053.274 21378.8875 157965.935  1000   b 
like image 90
JWilliman Avatar answered Oct 31 '22 07:10

JWilliman