Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert vector with local times to UTC

Tags:

time

r

I have a POSIXct vector that slightly misuses that format:

> head(df$datetime)
[1] "2016-03-03 12:30:00 UTC" "2016-03-03 12:00:00 UTC" "2016-02-27 09:00:00 UTC" "2016-03-03 17:30:00 UTC"
[5] "2016-03-03 10:30:00 UTC" "2016-03-03 14:30:00 UTC"

These datetimes are marked as being UTC times but are really in an assortment of local timezones:

> df %>% select(datetime, timezone) %>% head
         datetime            timezone
1 2016-03-03 12:30:00 Australia/Melbourne
2 2016-03-03 12:00:00 Europe/Berlin
3 2016-02-27 09:00:00 Europe/Amsterdam
4 2016-03-03 17:30:00 Australia/Brisbane
5 2016-03-03 10:30:00 Europe/Amsterdam
6 2016-03-03 14:30:00 Europe/Berlin

I would like to convert these datetimes to UTC proper – in some sense the inverse problem faced here and here – but am having a hard time. A variation of the solution from the second link works:

get_utc_time <- function(timestamp_local, local_tz) {
  l <- lapply(seq(length(timestamp_local)), 
              function(x) {with_tz(force_tz(timestamp_local[x], tzone=local_tz[x]), tzone='UTC')})
  as.POSIXct(combine(l), origin = '1970-01-01 00:00.00', tz = 'UTC')
}

df$datetime_utc <- get_utc_time(df$datetime, df$timezone)

(dplyr::mutate(df, datetime_utc = get_utc_time(datetime, timezone)), which I thought would be equivalent, throws an error.)

But since this isn't vectorized, it's terribly slow on a data.frame with half a million rows. Is there a more elegant and faster way to do this?

like image 355
RoyalTS Avatar asked Dec 15 '22 06:12

RoyalTS


1 Answers

The most 'official' way I know involves formatting and reparsing; David Smith had a post on this a while ago on the REvolutions blog.

Time series libraries, particularly those which are timezone-aware, can do it too. Here is an approach using RcppCCTZ which is my wrapper around CCTZ (written by some Googler's but not an official Google library) -- it computes the difference (by default in hours) between two timezones.

library(RcppCCTZ)  # you need the GitHub version though

# your data
df <- read.csv(text="datetime,timezone
2016-03-03 12:30:00,Australia/Melbourne
2016-03-03 12:00:00,Europe/Berlin
2016-02-27 09:00:00,Europe/Amsterdam
2016-03-03 17:30:00,Australia/Brisbane
2016-03-03 10:30:00,Europe/Amsterdam
2016-03-03 14:30:00,Europe/Berlin", stringsAsFactor=FALSE)

# parse to POSIXct
df[,"pt"] <- as.POSIXct(df[,"datetime"])

# compute difference
for (i in 1:6) 
    df[i,"diff"] <- tzDiff("UTC", df[i,"timezone"], df[i,"pt"])

This gets us this data.frame:

R> df
             datetime            timezone                  pt diff
1 2016-03-03 12:30:00 Australia/Melbourne 2016-03-03 12:30:00   11
2 2016-03-03 12:00:00       Europe/Berlin 2016-03-03 12:00:00    1
3 2016-02-27 09:00:00    Europe/Amsterdam 2016-02-27 09:00:00    1
4 2016-03-03 17:30:00  Australia/Brisbane 2016-03-03 17:30:00   10
5 2016-03-03 10:30:00    Europe/Amsterdam 2016-03-03 10:30:00    1
6 2016-03-03 14:30:00       Europe/Berlin 2016-03-03 14:30:00    1
R> 

It would be simple to return the parsed Datetime offset as well, but the little helper function tzDiff is not currently doing this. I could add that as a second helper function if you want to go this route...

Edit: This is an interesting problem. I have by now added some code to RcppCCTZ to do this, but it is not (yet, at least) vectorized. That said, we can have a IMHO much simpler and faster solution using data.table.

Let's first encode your solution and the three packages it needs:

library(lubridate)
library(magrittr)
library(dplyr)
useLubridate <- function(df) {
    df %>%
        group_by(timezone) %>%
        mutate(datetime_local = ymd_hms(datetime, tz=unique(timezone))) %>%
        mutate(datetime_utc = with_tz(datetime_local, tzone = 'UTC')) %>% 
        ungroup %>%
        select(datetime_local) -> df
    df
}

The let's do the same for data.table:

library(data.table)
useDataTable <- function(df) {
    dt <- as.data.table(df)
    dt[, pt := as.POSIXct(datetime, tz=timezone[1]), by=timezone] 
    dt[]
}

Note that this returns three columns rather than just one.

And while we're at it, let's do a horse race:

R> library(microbenchmark)
R> microbenchmark( useDataTable(df), useLubridate(df) )
Unit: milliseconds
             expr     min      lq    mean  median      uq      max neval cld
 useDataTable(df) 1.23148 1.53900 1.61174 1.57635 1.64734  3.85423   100  a 
 useLubridate(df) 7.51158 8.88734 9.10439 9.19390 9.38032 15.27572   100   b
R> 

So data.table is faster while also returning more useful information. Collating the third column back into a data.frame (or alike) would take up some more time.

like image 195
Dirk Eddelbuettel Avatar answered Dec 16 '22 19:12

Dirk Eddelbuettel