A sample of my dataframe:
date
1 25 February 1987
2 20 August 1974
3 9 October 1984
4 18 August 1992
5 19 September 1995
6 16-Oct-63
7 30-Sep-65
8 22 Jan 2008
9 13-11-1961
10 18 August 1987
11 15-Sep-70
12 5 October 1994
13 5 December 1984
14 03/23/87
15 30 August 1988
16 26-10-1993
17 22 August 1989
18 13-Sep-97
I have a large dataframe with a date variable that has multiple formats for dates. Most of the formats in the variable are shown above- there are a couple of very rare others too. The reason why there are multiple formats is that the data were pulled together from various websites that each used different formats.
I have tried using straightforward conversions e.g.
strftime(mydf$date,"%d/%m/%Y")
but these sorts of conversion will not work if there are multiple formats. I don't want to resort to multiple gsub type editing. I was wondering if I am missing a more simple solution?
Code for example:
structure(list(date = structure(c(12L, 8L, 18L, 6L, 7L, 4L, 14L,
10L, 1L, 5L, 3L, 17L, 16L, 11L, 15L, 13L, 9L, 2L), .Label = c("13-11-1961",
"13-Sep-97", "15-Sep-70", "16-Oct-63", "18 August 1987", "18 August 1992",
"19 September 1995", "20 August 1974", "22 August 1989", "22 Jan 2008",
"03/23/87", "25 February 1987", "26-10-1993", "30-Sep-65", "30 August 1988",
"5 December 1984", "5 October 1994", "9 October 1984"), class = "factor")), .Names = "date", row.names = c(NA,
-18L), class = "data.frame")
Select the dates you want to convert, right click to select Format Cells from context menu. 2. In the Format Cells dialog, under Number tab, select Date from Category list, and then select one format you want to convert to from the right section.
You may try parse_date_time
in package lubridate
which "allows the user to specify several format-orders to handle heterogeneous date-time character representations" using the orders
argument. Something like...
library(lubridate)
parse_date_time(x = df$date,
orders = c("d m y", "d B Y", "m/d/y"),
locale = "eng")
...should be able to handle most of your formats. Please note that b
/B
formats are locale
sensitive.
Other date-time formats which can be used in orders
are listed in the Details section in ?strptime
.
Here is a base solution:
fmts <- c("%d-%b-%y", "%d %b %Y", "%d-%m-%Y", "%m/%d/%y")
d <- as.Date(as.numeric(apply(outer(DF$date, fmts, as.Date), 1, na.omit)), "1970-01-01")
We have made the simplifying assumption that exactly 1 format works for each input date. That seems to be the case in the example but if not replace na.omit
with function(x) c(na.omit(x), NA)[1])
.
Note that a two digit year can be ambiguous but here it seems it should always be in the past so we subtract 100 years if not:
past <- function(x) ifelse(x > Sys.Date(), seq(from=x, length=2, by="-100 year")[2], x)
as.Date(sapply(d, past), "1970-01-01")
For the sample data the last line gives:
[1] "1987-02-25" "1974-08-20" "1984-10-09" "1992-08-18" "1995-09-19"
[6] "1963-10-16" "1965-09-30" "2008-01-22" "1961-11-13" "1987-08-18"
[11] "1970-09-15" "1994-10-05" "1984-12-05" "1987-03-23" "1988-08-30"
[16] "1993-10-26" "1989-08-22" "1997-09-13"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With