Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Format for ordinal dates (day of month with suffixes -st, -nd, -rd, -th)

Tags:

date

r

Am I missing something? I can't figure out how to convert the following to Dates, where day of the month (%d) has the ordinal suffixes -st, -nd, -rd, -th:

ord_dates <- c("September 1st, 2016", "September 2nd, 2016",
               "September 3rd, 2016", "September 4th, 2016")

?strptime doesn't appear to list a shorthand for the ordinal suffix, and it isn't handled automagically:

as.Date(ord_dates, format = c("%B %d, %Y"))
#[1] NA NA NA NA

Is there a token for handling ignored characters in the format argument? A token I'm missing?

Best I can come up with is (there may a shorter regex, but same idea):

as.Date(gsub("([0-9]+)(st|nd|rd|th)", "\\1", ord_dates), format = "%B %d, %Y")
# [1] "2016-09-01" "2016-09-02" "2016-09-03" "2016-09-04"

Seems like this sort of data should be relatively common; am I missing something?

like image 588
MichaelChirico Avatar asked Aug 30 '16 21:08

MichaelChirico


People also ask

What is ordinal date format?

Ordinal date. An ordinal date is a calendar date typically consisting of a year and a day of year ranging between 1 and 366 (starting on January 1), though year may sometimes be omitted. The two numbers can be formatted as YYYY-DDD to comply with the ISO 8601 ordinal date format.

How do you format month and day?

The United States is one of the few countries that use “mm-dd-yyyy” as their date format–which is very very unique! The day is written first and the year last in most countries (dd-mm-yyyy) and some nations, such as Iran, Korea, and China, write the year first and the day last (yyyy-mm-dd).

Which one is the correct format of date data?

The international standard recommends writing the date as year, then month, then the day: YYYY-MM-DD.


1 Answers

Enjoy the power of lubridate:

library(lubridate)    
mdy(ord_dates)

[1] "2016-09-01" "2016-09-02" "2016-09-03" "2016-09-04"

Internally, lubridate doesn't have any special conversion specifications which enable this. Rather, lubridate first uses (by smart guessing) the format "%B %dst, %Y". This gets the first element of ord_dates.

It then checks for NAs and repeats its smart guessing on the remaining elements, settling on "%B %dnd, %Y" to get the second element. It continues in this way until there are no NAs left (which happens in this case after 4 iterations), or until its smart guessing fails to turn up a likely format candidate.

You can imagine this makes lubridate slower, and it does -- about half the speed of just using the smart regex suggested by @alistaire above:

set.seed(109123)
ord_dates <- sample(
  c("September 1st, 2016", "September 2nd, 2016",
    "September 3rd, 2016", "September 4th, 2016"),
  1e6, TRUE
  )

library(microbenchmark)

microbenchmark(times = 10L,
               lubridate = mdy(ord_dates),
               base = as.Date(sub("\\D+,", "", ord_dates),
                              format = "%B %e %Y"))
# Unit: seconds
#       expr      min       lq     mean   median       uq      max neval cld
#  lubridate 2.167957 2.219463 2.290950 2.252565 2.301725 2.587724    10   b
#       base 1.183970 1.224824 1.218642 1.227034 1.228324 1.229095    10  a 

The obvious advantage in lubridate's favor being its conciseness and flexibility.

like image 165
thepule Avatar answered Nov 16 '22 04:11

thepule