I am facing the common task of calculating the age (in years, months, or weeks) given the date of birth and an arbitrary date. The thing is that quite often I have to do this over many many records (>300 millions), so performance is a key issue here.
After a quick search in SO and Google I found 3 alternatives:
new_interval()
and duration()
from package lubridate
(link)age_calc()
from package eeptools
(link, link, link)So, here's my toy code:
# Some toy birthdates birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01", "1962-12-30", "1962-12-31", "1963-01-01", "2000-06-16", "2000-06-17", "2000-06-18", "2007-03-18", "2007-03-19", "2007-03-20", "1968-02-29", "1968-02-29", "1968-02-29")) # Given dates to calculate the age givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31", "2015-12-31", "2015-12-31", "2015-12-31", "2050-06-17", "2050-06-17", "2050-06-17", "2008-03-19", "2008-03-19", "2008-03-19", "2015-02-28", "2015-03-01", "2015-03-02")) # Using a common arithmetic procedure ("Time differences in days"/365.25) (givendate-birthdate)/365.25 # Use the package lubridate require(lubridate) new_interval(start = birthdate, end = givendate) / duration(num = 1, units = "years") # Use the package eeptools library(eeptools) age_calc(dob = birthdate, enddate = givendate, units = "years")
Let's talk later about accuracy and focus first on performance. Here's the code:
# Now let's compare the performance of the alternatives using microbenchmark library(microbenchmark) mbm <- microbenchmark( arithmetic = (givendate - birthdate) / 365.25, lubridate = new_interval(start = birthdate, end = givendate) / duration(num = 1, units = "years"), eeptools = age_calc(dob = birthdate, enddate = givendate, units = "years"), times = 1000 ) # And examine the results mbm autoplot(mbm)
Here the results:
Bottom line: performance of lubridate
and eeptools
functions is much worse than the arithmetic method (/365.25 is at least 10 times faster). Unfortunately, the arithmetic method is not accurate enough and I cannot afford the few mistakes that this method will make.
"because of the way the modern Gregorian calendar is constructed, there is no straightforward arithmetic method that produces a person’s age, stated according to common usage — common usage meaning that a person’s age should always be an integer that increases exactly on a birthday". (link)
As I read on some posts, lubridate
and eeptools
make no such mistakes (though, I haven't looked at the code/read more about those functions to know which method they use) and that's why I wanted to use them, but their performance does not work for my real application.
Any ideas on an efficient and accurate method to calculate the age?
Ops, it seems lubridate
also makes mistakes. And apparently based on this toy example, it makes more mistakes than the arithmetic method (see lines 3, 6, 9, 12). (am I doing something wrong?)
toy_df <- data.frame( birthdate = birthdate, givendate = givendate, arithmetic = as.numeric((givendate - birthdate) / 365.25), lubridate = new_interval(start = birthdate, end = givendate) / duration(num = 1, units = "years"), eeptools = age_calc(dob = birthdate, enddate = givendate, units = "years") ) toy_df[, 3:5] <- floor(toy_df[, 3:5]) toy_df birthdate givendate arithmetic lubridate eeptools 1 1978-12-30 2015-12-31 37 37 37 2 1978-12-31 2015-12-31 36 37 37 3 1979-01-01 2015-12-31 36 37 36 4 1962-12-30 2015-12-31 53 53 53 5 1962-12-31 2015-12-31 52 53 53 6 1963-01-01 2015-12-31 52 53 52 7 2000-06-16 2050-06-17 50 50 50 8 2000-06-17 2050-06-17 49 50 50 9 2000-06-18 2050-06-17 49 50 49 10 2007-03-18 2008-03-19 1 1 1 11 2007-03-19 2008-03-19 1 1 1 12 2007-03-20 2008-03-19 0 1 0 13 1968-02-29 2015-02-28 46 47 46 14 1968-02-29 2015-03-01 47 47 47 15 1968-02-29 2015-03-02 47 47 47
How is the age calculated ? Age is calculated by counting the number of years, months and days completed since birth. Leaps years and months with 31 days are all factored in the calculations. So, you can expect a very accurate calculation of the age down to the number of days.
Age is extracted from Date_of_birth column using difftime() functions in roundabout way by extracting the number of weeks between date of birth and current date and dividing by 52.25, as shown below.
Age of a Person = Given date - Date of birth. Ron's Date of Birth = July 25, 1985. Given date = January 28, 2021. Years' Difference = 2020 - 1985 = 35 years.
Ans: To find out a person's age, all you need is that person's year of birth. After this, all you need to do now is subtract the birth year from the ongoing current year and you will have the age. This will help you to calculate the age from the date of birth. Age= 2020- 1966 = 54.
The reason lubridate appears to be making mistakes above is that you are calculating duration (the exact amount of time that occurs between two instants, where 1 year = 31536000s), rather than periods (the change in clock time that occurs between two instants).
To get the change in clock time (in years, months, days, etc) you need to use
as.period(interval(start = birthdate, end = givendate))
which gives the following output
"37y 0m 1d 0H 0M 0S" "37y 0m 0d 0H 0M 0S" "36y 11m 30d 0H 0M 0S" ... "46y 11m 30d 1H 0M 0S" "47y 0m 0d 1H 0M 0S" "47y 0m 1d 1H 0M 0S"
To just extract years, you can use the following
as.period(interval(start = birthdate, end = givendate))$year [1] 37 37 36 53 53 52 50 50 49 1 1 0 46 47 47
Note sadly appears even slower than the methods above!
> mbm Unit: microseconds expr min lq mean median uq max neval cld arithmetic 116.595 138.149 181.7547 184.335 196.8565 5556.306 1000 a lubridate 16807.683 17406.255 20388.1410 18053.274 21378.8875 157965.935 1000 b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With