I am working with .csv data that was exported from Teradata. Several columns were originally timestamps with timezones, so after loading the .csv in R I'd like to convert these columns (which are loaded as strings) to POSIXlt or POSIXct. I am using strptime
, but the format of the timezone from the .csv file does not match what strptime
is expecting. For example, it expects -0400
but the .csv has the format -04:00
where a colon separates the hours and minutes.
I can remove the colon, but this is an extra step and complication I'd like to avoid if possible. Is there a way to tell strptime
to use a different format for the timezone (%z
)?
Here is an example:
## Example data:
x <- c("2011-10-12 22:17:13.860746-04:00", "2011-10-12 22:17:13.860746+00:00")
format <- "%Y-%m-%d %H:%M:%OS%z"
## Doesn't work:
strptime(x,format)
## [1] NA NA
## Ignores the timezone:
as.POSIXct(x)
## [1] "2011-10-12 22:17:13 EDT" "2011-10-12 22:17:13 EDT"
## Remove the last colon:
x2 <- gsub("(.*):", "\\1", x)
x2
## [1] "2011-10-12 22:17:13.860746-0400" "2011-10-12 22:17:13.860746+0000"
## This works, but requires extra processing (removing the colon)
strptime(x2,format)
## [1] "2011-10-12 22:17:13" "2011-10-12 18:17:13"
So I'm looking to achieve this last result using something like strptime(x,"%Y-%m-%d %H:%M:%OS%zz")
, where %zz
is a custom expression for the timezone that recognizes the -04:00
format. Or %zH:%zM
might be even better.
If this isn't possible, does anyone have a slick/flexible function for converting strings (of various formats) to dates for multiple columns of a data.frame/data.table?
It turns out lubridate
can handle this format:
library(lubridate)
ymd_hms(x)
## [1] "2011-10-13 02:17:13 UTC" "2011-10-12 22:17:13 UTC"
Or, to display in the local timezone:
with_tz(ymd_hms(x))
## [1] "2011-10-12 22:17:13 EDT" "2011-10-12 18:17:13 EDT"
For more flexibility (still using lubridate
):
parse_date_time(x, "%Y-%m-%d %H:%M:%OS%z")
For faster speed (amongst lubridate
options):
lubridate:::.strptime(x, "%Y-%m-%d %H:%M:%OS%OO")
Timings:
microbenchmark(
ymd_hms(x),
parse_date_time(x, "%Y-%m-%d %H:%M:%OS%z"),
lubridate:::.strptime(x, "%Y-%m-%d %H:%M:%OS%OO"),
strptime(gsub("(.*):", "\\1", x), format)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## ymd_hms(x) 1523.819 1578.495 1715.14577 1629.5385 1744.3695 2850.393 100
## parse_date_time(x, "%Y-%m-%d %H:%M:%OS%z") 1108.676 1150.633 1273.77301 1190.3315 1264.8050 5947.204 100
## lubridate:::.strptime(x, "%Y-%m-%d %H:%M:%OS%OO") 89.838 103.390 112.45338 107.8425 115.2265 216.512 100
## strptime(gsub("(.*):", "\\\\1", x), format) 46.716 58.294 71.90934 69.9415 86.5860 105.044 100
I've just come across this question trying to achieve the same thing.
The only thing I've found to fix it is to use regex to remove the colon, as you have mentioned. You can tighten the regex a little to avoid making mistakes in the replacement.
x2 <- gsub('^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\\.[0-9]+[+-][0-9]{2}):([0-9]{2})$',
'\\1\\2',
x)
# [1] "2011-10-12 22:17:13.860746-0400" "2011-10-12 22:17:13.860746+0000"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With