I have a large data set with many columns containing dates in two different formats:
"1996-01-04" "1996-01-05" "1996-01-08" "1996-01-09" "1996-01-10" "1996-01-11"
and
"02/01/1996" "03/01/1996" "04/01/1996" "05/01/1996" "08/01/1996" "09/01/1996"
In both cases, the class() is "character". Since the data set has many rows (4.5 million), I am looking for an efficient data.table conversion method. Right now, I use this self-built function:
convert_to_date <- function(in_array){
tmp <- try(as.Date(in_array, format = "%d/%m/%Y"),TRUE)
if (all(!is.na(tmp)) & class(tmp) != "try-error"){
return(tmp)
} else{
tmp2 <- try(as.Date(in_array),TRUE)
if (all(!is.na(tmp2)) & class(tmp2) != "try-error"){
return(tmp2)
} else{
return(in_array)
}
}
}
With which I then convert the columns (of data.table DF) that I need by
DF[,date:=convert_to_date(date)]
This is, however, still incredibly slow (nearly 45s per column).
Is there any way in optimising this via data.table methods? So far I have not found a better way, so I would be thankful for any tips.
P.S: For better readability, I have 'outsourced' the function to a second file and sourced it in my main routine. Does that have a (negative) significant impact on computation speed in R?
According to this benchmark, the fastest method to convert character dates in standard unambiguous format (YYYY-MM-DD
) into class Date
is to use as.Date(fasttime::fastPOSIXct())
.
Unfortunately, this requires to test the format beforehand because your other format DD/MM/YYYY
is misinterpreted by fasttime::fastPOSIXct()
.
So, if you don't want to bother about the format of each date column you may use the anytime::anydate()
function:
# sample data
df <- data.frame(
X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"),
X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"),
stringsAsFactors = FALSE)
library(data.table)
# convert date columns
date_cols <- c("X1", "X2")
setDT(df)[, (date_cols) := lapply(.SD, anytime::anydate), .SDcols = date_cols]
df
X1 X2 1: 1996-01-04 1996-02-01 2: 1996-01-05 1996-03-01 3: 1996-01-08 1996-04-01 4: 1996-01-09 1996-05-01 5: 1996-01-10 1996-08-01 6: 1996-01-11 1996-09-01
The benchmark timings show that there is a trade off between the convenience offered by the anytime
package and performance. So if speed is crucial, there is no other way to test the format of each column and to use the fastest conversion method available for the format.
The OP has used the try()
function for this purpose. The solution below uses regular expressions to find all columns which match a given format (only row 1 is used to save time). This has the additional benefit that the names of the relevant columns are determined automatically and need not to be typed in.
# enhanced sample data with additional columns
df <- data.frame(
X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"),
X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"),
X3 = "other data",
X4 = 1:6,
stringsAsFactors = FALSE)
library(data.table)
options(datatable.print.class = TRUE)
# coerce to data.table
setDT(df)[]
# convert date columns in standard unambiguous format YYYY-MM-DD
date_cols1 <- na.omit(names(df)[
df[1, sapply(.SD, stringr::str_detect, pattern = "\\d{4}-\\d{2}-\\d{2}"),]])
# use fasttime package
df[, (date_cols1) := lapply(.SD, function(x) as.Date(fasttime::fastPOSIXct(x))),
.SDcols = date_cols1]
# convert date columns in DD/MM/YYYY format
date_cols2 <- na.omit(names(df)[
df[1, sapply(.SD, stringr::str_detect, pattern = "\\d{2}/\\d{2}/\\d{4}"),]])
# use lubridate package
df[, (date_cols2) := lapply(.SD, lubridate::dmy), .SDcols = date_cols2]
df
X1 X2 X3 X4 <Date> <Date> <char> <int> 1: 1996-01-04 1996-01-02 other data 1 2: 1996-01-05 1996-01-03 other data 2 3: 1996-01-08 1996-01-04 other data 3 4: 1996-01-09 1996-01-05 other data 4 5: 1996-01-10 1996-01-08 other data 5 6: 1996-01-11 1996-01-09 other data 6
In case one of the date columns does contain NA
in the first row, this column may escape unconverted. To handle these cases, the above code needs to be amended.
df <- data.frame(X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"), X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"), stringsAsFactors=F)
'data.frame': 6 obs. of 2 variables:
$ X1: chr "1996-01-04" "1996-01-05" "1996-01-08" "1996-01-09" ...
$ X2: chr "02/01/1996" "03/01/1996" "04/01/1996" "05/01/1996" ...
library(dplyr)
library(lubridate)
ans <- df %>%
mutate(X1 = ymd(X1), X2 = mdy(X2))
X1 X2
1 1996-01-04 1996-02-01
2 1996-01-05 1996-03-01
3 1996-01-08 1996-04-01
4 1996-01-09 1996-05-01
5 1996-01-10 1996-08-01
6 1996-01-11 1996-09-01
str(ans)
'data.frame': 6 obs. of 2 variables:
$ X1: Date, format: "1996-01-04" "1996-01-05" ...
$ X2: Date, format: "1996-02-01" "1996-03-01" ...
Since you know beforehand there are only two date formats, this is easy. The format
argument to as.Date
is vectorized:
as_date_either <- function(x) {
format_vec <- rep_len("%Y-%m-%d", length(x))
format_vec[grep("/", x, fixed = TRUE)] <- "%m/%d/%Y"
as.Date(x, format = format_vec)
}
Edited: replaced ifelse
with subset assignment, which is faster
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With