I have data like this:
library(data.table)
NN = 10000000
set.seed(32040)
DT <- data.table(
col = 1:10000000,
timestamp = 1521872652 + sample(7000001, NN, replace = TRUE)
)
I'm trying to pull the unique year and week as a code so I can sort duplicates (the real data table has userID along with much more). I have a current solution that works (below), but it's slow on the part pasting weeks and year uniquely from the date column. The creation of the date using anytime
package and pulling week
and year
from lubridate
are still very fast. Can someone help me speed this up? Thanks!
My slow code (works but I'd like to speed it up):
library(anytime)
library(lubridate)
tz<-"Africa/Addis_Ababa"
DT$localtime<- anytime(DT$timestamp, tz=tz) ###Lightning fast
DT$weekuni <- paste(year(DT$localtime),week(DT$localtime),sep="") ###super slow
My tests show it's the paste
that's killing me:
Very fast anytime
conversion to date:
system.time(DT$localtime<- anytime(DT$timestamp, tz=tz)) ###Lightning fast
user system elapsed
0.264 0.417 0.933
Fast lubridate
week and year conversion from date, but slow paste
:
> system.time(DT$weekuni1 <- week(DT$localtime)) ###super slow
user system elapsed
1.203 0.188 1.400
> system.time(DT$weekuni2 <- year(DT$localtime))
user system elapsed
1.229 0.189 1.427
> system.time(DT$weekuni <- paste0(DT$weekuni1,dt$weekuni2))
user system elapsed
14.652 0.344 15.483
If you're willing to define a year-week based only on the date, you can get a solution that's 20 times faster:
library(data.table)
NN = 10000000
# NN = 1e4
set.seed(32040)
DT <- data.table(
col = seq_len(NN),
timestamp = 1521872652 + sample(7000001, NN, replace = TRUE)
)
DT1 <- copy(DT)
DT2 <- copy(DT)
tz <- "Africa/Addis_Ababa"
old <- function(DT) {
DT$localtime<- anytime::anytime(DT$timestamp, tz=tz) ###Lightning fast
DT$weekuni <- paste(lubridate::year(DT$localtime), lubridate::week(DT$localtime), sep="")
DT[, timestamp := NULL]
DT[, .(col, localtime, weekuni)]
}
new <- function(DT) {
DT[ , localtime := anytime::anytime(timestamp, tz = tz)]
DT[, Date := as.Date(localtime)]
DT[, weekuni := paste0(lubridate::year(.BY[[1L]]), lubridate::week(.BY[[1L]])),
keyby = "Date"]
DT[, Date := NULL]
# DT[, timestamp := NULL]
DT[order(col), .(col, localtime, weekuni)]
}
bench::mark(old(DT1), new(DT2), check = FALSE, filter_gc = FALSE)
#> # A tibble: 2 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch:t> <bch:t> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int>
#> 1 old(DT1) 22.39s 22.39s 22.39s 22.39s 0.0447 2.28GB 5 1
#> 2 new(DT2) 1.13s 1.13s 1.13s 1.13s 0.888 878.12MB 1 1
#> # ... with 1 more variable: total_time <bch:tm>
Created on 2018-06-23 by the reprex package (v0.2.0).
Even if you don't you can still obtain 10-fold speedup by only using paste
once per date:
library(data.table)
NN = 1e7
# NN = 1e4
set.seed(32040)
DT <- data.table(
col = seq_len(NN),
timestamp = 1521872652 + sample(7000001, NN, replace = TRUE)
)
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
tz <- "Africa/Addis_Ababa"
old <- function(DT) {
DT$localtime<- anytime::anytime(DT$timestamp, tz=tz) ###Lightning fast
DT$weekuni <- paste(lubridate::year(DT$localtime), lubridate::week(DT$localtime), sep="")
DT[, timestamp := NULL]
DT[, .(col, weekuni)]
}
new <- function(DT) {
DT[ , Date := anytime::anydate(timestamp, tz = tz)]
DT[, weekuni := paste0(lubridate::year(.BY[[1L]]), lubridate::week(.BY[[1L]])),
keyby = "Date"]
DT[, Date := NULL]
# DT[, timestamp := NULL]
setorderv(DT[, .(col, weekuni)], "col")
}
bench::mark(old(DT1), new(DT2), check = TRUE, filter_gc = FALSE)
#> # A tibble: 2 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch:t> <bch:t> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int>
#> 1 old(DT1) 22.2s 22.2s 22.2s 22.2s 0.0450 2.21GB 4 1
#> 2 new(DT2) 2.8s 2.8s 2.8s 2.8s 0.357 1.42GB 3 1
#> # ... with 1 more variable: total_time <bch:tm>
I made your code run about 50% faster using format
instead of paste
.
First, I'm not sure the point of anytime
for your use case since we can just throw the timestamp into a POSIXct
structure almost instantly:
DT[ , localtime := .POSIXct(timestamp, tz = tz)]
Next, I searched around on ?strptime
for the ISO-week-based formatting codes to get:
DT[ , weekuni := format(localtime, format = '%G%V')]
I'm not 100% sure this will always be the same as paste(year, week)
, but it was for your test data; if there is a difference between them, you should ask if that really matters for you.
The only thing I can think of that might be faster would be to use integer arithmetic on the timestamp itself. This is substantially easier if Africa/Addis_Ababa
time zone doesn't have any adjustment to its UTC offset in your sample timeframe (unfortunately, it looks like Africa/Addis_Ababa
observes daylight savings time, so the UTC offset varies between 2 & 3 hours, making the integer arithmetic approach substantially more difficult)
For the record, using data.table::year
and data.table::week
is about as fast as the approach used here, but it uses a different definition of "year" and "week" than lubridate
(which by default uses the ISO year/week that %G%V
does above).
data.table
doesn't yet have an isoyear
implementation, and data.table::isoweek
is substantially slower than lubridate::week
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With