I have a POSIXct
vector that slightly misuses that format:
> head(df$datetime)
[1] "2016-03-03 12:30:00 UTC" "2016-03-03 12:00:00 UTC" "2016-02-27 09:00:00 UTC" "2016-03-03 17:30:00 UTC"
[5] "2016-03-03 10:30:00 UTC" "2016-03-03 14:30:00 UTC"
These datetimes are marked as being UTC times but are really in an assortment of local timezones:
> df %>% select(datetime, timezone) %>% head
datetime timezone
1 2016-03-03 12:30:00 Australia/Melbourne
2 2016-03-03 12:00:00 Europe/Berlin
3 2016-02-27 09:00:00 Europe/Amsterdam
4 2016-03-03 17:30:00 Australia/Brisbane
5 2016-03-03 10:30:00 Europe/Amsterdam
6 2016-03-03 14:30:00 Europe/Berlin
I would like to convert these datetimes to UTC proper – in some sense the inverse problem faced here and here – but am having a hard time. A variation of the solution from the second link works:
get_utc_time <- function(timestamp_local, local_tz) {
l <- lapply(seq(length(timestamp_local)),
function(x) {with_tz(force_tz(timestamp_local[x], tzone=local_tz[x]), tzone='UTC')})
as.POSIXct(combine(l), origin = '1970-01-01 00:00.00', tz = 'UTC')
}
df$datetime_utc <- get_utc_time(df$datetime, df$timezone)
(dplyr::mutate(df, datetime_utc = get_utc_time(datetime, timezone))
, which I thought would be equivalent, throws an error.)
But since this isn't vectorized, it's terribly slow on a data.frame with half a million rows. Is there a more elegant and faster way to do this?
The most 'official' way I know involves formatting and reparsing; David Smith had a post on this a while ago on the REvolutions blog.
Time series libraries, particularly those which are timezone-aware, can do it too. Here is an approach using RcppCCTZ which is my wrapper around CCTZ (written by some Googler's but not an official Google library) -- it computes the difference (by default in hours) between two timezones.
library(RcppCCTZ) # you need the GitHub version though
# your data
df <- read.csv(text="datetime,timezone
2016-03-03 12:30:00,Australia/Melbourne
2016-03-03 12:00:00,Europe/Berlin
2016-02-27 09:00:00,Europe/Amsterdam
2016-03-03 17:30:00,Australia/Brisbane
2016-03-03 10:30:00,Europe/Amsterdam
2016-03-03 14:30:00,Europe/Berlin", stringsAsFactor=FALSE)
# parse to POSIXct
df[,"pt"] <- as.POSIXct(df[,"datetime"])
# compute difference
for (i in 1:6)
df[i,"diff"] <- tzDiff("UTC", df[i,"timezone"], df[i,"pt"])
This gets us this data.frame:
R> df
datetime timezone pt diff
1 2016-03-03 12:30:00 Australia/Melbourne 2016-03-03 12:30:00 11
2 2016-03-03 12:00:00 Europe/Berlin 2016-03-03 12:00:00 1
3 2016-02-27 09:00:00 Europe/Amsterdam 2016-02-27 09:00:00 1
4 2016-03-03 17:30:00 Australia/Brisbane 2016-03-03 17:30:00 10
5 2016-03-03 10:30:00 Europe/Amsterdam 2016-03-03 10:30:00 1
6 2016-03-03 14:30:00 Europe/Berlin 2016-03-03 14:30:00 1
R>
It would be simple to return the parsed Datetime offset as well, but the little helper function tzDiff is not currently doing this. I could add that as a second helper function if you want to go this route...
Edit: This is an interesting problem. I have by now added some code to RcppCCTZ to do this, but it is not (yet, at least) vectorized. That said, we can have a IMHO much simpler and faster solution using data.table.
Let's first encode your solution and the three packages it needs:
library(lubridate)
library(magrittr)
library(dplyr)
useLubridate <- function(df) {
df %>%
group_by(timezone) %>%
mutate(datetime_local = ymd_hms(datetime, tz=unique(timezone))) %>%
mutate(datetime_utc = with_tz(datetime_local, tzone = 'UTC')) %>%
ungroup %>%
select(datetime_local) -> df
df
}
The let's do the same for data.table:
library(data.table)
useDataTable <- function(df) {
dt <- as.data.table(df)
dt[, pt := as.POSIXct(datetime, tz=timezone[1]), by=timezone]
dt[]
}
Note that this returns three columns rather than just one.
And while we're at it, let's do a horse race:
R> library(microbenchmark)
R> microbenchmark( useDataTable(df), useLubridate(df) )
Unit: milliseconds
expr min lq mean median uq max neval cld
useDataTable(df) 1.23148 1.53900 1.61174 1.57635 1.64734 3.85423 100 a
useLubridate(df) 7.51158 8.88734 9.10439 9.19390 9.38032 15.27572 100 b
R>
So data.table is faster while also returning more useful information. Collating the third column back into a data.frame (or alike) would take up some more time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With