Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert variable with mixed date formats to one format?

Tags:

A sample of my dataframe:

                date
1   25 February 1987
2     20 August 1974
3     9 October 1984
4     18 August 1992
5  19 September 1995
6          16-Oct-63
7          30-Sep-65
8        22 Jan 2008
9         13-11-1961
10    18 August 1987
11         15-Sep-70
12    5 October 1994
13   5 December 1984
14          03/23/87
15    30 August 1988
16        26-10-1993
17    22 August 1989
18         13-Sep-97

I have a large dataframe with a date variable that has multiple formats for dates. Most of the formats in the variable are shown above- there are a couple of very rare others too. The reason why there are multiple formats is that the data were pulled together from various websites that each used different formats.

I have tried using straightforward conversions e.g.

strftime(mydf$date,"%d/%m/%Y")

but these sorts of conversion will not work if there are multiple formats. I don't want to resort to multiple gsub type editing. I was wondering if I am missing a more simple solution?

Code for example:

    structure(list(date = structure(c(12L, 8L, 18L, 6L, 7L, 4L, 14L, 
10L, 1L, 5L, 3L, 17L, 16L, 11L, 15L, 13L, 9L, 2L), .Label = c("13-11-1961", 
"13-Sep-97", "15-Sep-70", "16-Oct-63", "18 August 1987", "18 August 1992", 
"19 September 1995", "20 August 1974", "22 August 1989", "22 Jan 2008", 
"03/23/87", "25 February 1987", "26-10-1993", "30-Sep-65", "30 August 1988", 
"5 December 1984", "5 October 1994", "9 October 1984"), class = "factor")), .Names = "date", row.names = c(NA, 
-18L), class = "data.frame")
like image 607
jalapic Avatar asked Aug 23 '14 15:08

jalapic


People also ask

How do I convert multiple date formats in Excel?

Select the dates you want to convert, right click to select Format Cells from context menu. 2. In the Format Cells dialog, under Number tab, select Date from Category list, and then select one format you want to convert to from the right section.


2 Answers

You may try parse_date_time in package lubridate which "allows the user to specify several format-orders to handle heterogeneous date-time character representations" using the orders argument. Something like...

library(lubridate)
parse_date_time(x = df$date,
                orders = c("d m y", "d B Y", "m/d/y"),
                locale = "eng")

...should be able to handle most of your formats. Please note that b/B formats are locale sensitive.

Other date-time formats which can be used in orders are listed in the Details section in ?strptime.

like image 50
Henrik Avatar answered Nov 08 '22 10:11

Henrik


Here is a base solution:

fmts <- c("%d-%b-%y", "%d %b %Y", "%d-%m-%Y", "%m/%d/%y")
d <- as.Date(as.numeric(apply(outer(DF$date, fmts, as.Date), 1, na.omit)), "1970-01-01")

We have made the simplifying assumption that exactly 1 format works for each input date. That seems to be the case in the example but if not replace na.omit with function(x) c(na.omit(x), NA)[1]).

Note that a two digit year can be ambiguous but here it seems it should always be in the past so we subtract 100 years if not:

 past <- function(x) ifelse(x > Sys.Date(), seq(from=x, length=2, by="-100 year")[2], x)
 as.Date(sapply(d, past), "1970-01-01")

For the sample data the last line gives:

[1] "1987-02-25" "1974-08-20" "1984-10-09" "1992-08-18" "1995-09-19"
[6] "1963-10-16" "1965-09-30" "2008-01-22" "1961-11-13" "1987-08-18"
[11] "1970-09-15" "1994-10-05" "1984-12-05" "1987-03-23" "1988-08-30"
[16] "1993-10-26" "1989-08-22" "1997-09-13"
like image 33
G. Grothendieck Avatar answered Nov 08 '22 12:11

G. Grothendieck