Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increase paste speed of two columns in data.table in R (reproducible)

I have data like this:

library(data.table)
NN = 10000000
set.seed(32040)
DT <- data.table(
  col = 1:10000000,
  timestamp = 1521872652 + sample(7000001, NN, replace = TRUE)
)

I'm trying to pull the unique year and week as a code so I can sort duplicates (the real data table has userID along with much more). I have a current solution that works (below), but it's slow on the part pasting weeks and year uniquely from the date column. The creation of the date using anytime package and pulling week and year from lubridate are still very fast. Can someone help me speed this up? Thanks!

My slow code (works but I'd like to speed it up):

library(anytime)
library(lubridate)
tz<-"Africa/Addis_Ababa"
DT$localtime<-  anytime(DT$timestamp, tz=tz) ###Lightning fast
DT$weekuni <- paste(year(DT$localtime),week(DT$localtime),sep="") ###super slow

My tests show it's the paste that's killing me:

Very fast anytime conversion to date:

system.time(DT$localtime<-  anytime(DT$timestamp, tz=tz)) ###Lightning fast
       user  system elapsed 
      0.264   0.417   0.933 

Fast lubridate week and year conversion from date, but slow paste:

> system.time(DT$weekuni1 <- week(DT$localtime)) ###super slow
   user  system elapsed 
  1.203   0.188   1.400 
> system.time(DT$weekuni2 <- year(DT$localtime))
   user  system elapsed 
  1.229   0.189   1.427 
> system.time(DT$weekuni <- paste0(DT$weekuni1,dt$weekuni2))
   user  system elapsed 
 14.652   0.344  15.483
like image 545
Neal Barsch Avatar asked Jan 27 '23 21:01

Neal Barsch


2 Answers

If you're willing to define a year-week based only on the date, you can get a solution that's 20 times faster:

library(data.table)
NN = 10000000
# NN = 1e4
set.seed(32040)
DT <- data.table(
  col = seq_len(NN),
  timestamp = 1521872652 + sample(7000001, NN, replace = TRUE)
)
DT1 <- copy(DT)

DT2 <- copy(DT)
tz <- "Africa/Addis_Ababa"

old <- function(DT) {
  DT$localtime<-  anytime::anytime(DT$timestamp, tz=tz) ###Lightning fast
  DT$weekuni <- paste(lubridate::year(DT$localtime), lubridate::week(DT$localtime), sep="")
  DT[, timestamp := NULL]
  DT[, .(col, localtime, weekuni)]
}

new <- function(DT) {
  DT[ , localtime := anytime::anytime(timestamp, tz = tz)]
  DT[, Date := as.Date(localtime)]
  DT[, weekuni := paste0(lubridate::year(.BY[[1L]]), lubridate::week(.BY[[1L]])),
     keyby = "Date"]
  DT[, Date := NULL]
  # DT[, timestamp := NULL]
  DT[order(col), .(col, localtime, weekuni)]
}

bench::mark(old(DT1), new(DT2), check = FALSE, filter_gc = FALSE)
#> # A tibble: 2 x 10
#>   expression     min    mean median    max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch:t> <bch:> <bch:>     <dbl> <bch:byt> <dbl> <int>
#> 1 old(DT1)    22.39s  22.39s 22.39s 22.39s    0.0447    2.28GB     5     1
#> 2 new(DT2)     1.13s   1.13s  1.13s  1.13s    0.888   878.12MB     1     1
#> # ... with 1 more variable: total_time <bch:tm>

Created on 2018-06-23 by the reprex package (v0.2.0).

Even if you don't you can still obtain 10-fold speedup by only using paste once per date:

library(data.table)
NN = 1e7
# NN = 1e4
set.seed(32040)
DT <- data.table(
  col = seq_len(NN),
  timestamp = 1521872652 + sample(7000001, NN, replace = TRUE)
)
DT1 <- copy(DT)

DT2 <- copy(DT)
DT3 <- copy(DT)
tz <- "Africa/Addis_Ababa"

old <- function(DT) {
  DT$localtime<-  anytime::anytime(DT$timestamp, tz=tz) ###Lightning fast
  DT$weekuni <- paste(lubridate::year(DT$localtime), lubridate::week(DT$localtime), sep="")
  DT[, timestamp := NULL]
  DT[, .(col, weekuni)]
}

new <- function(DT) {
  DT[ , Date := anytime::anydate(timestamp, tz = tz)]
  DT[, weekuni := paste0(lubridate::year(.BY[[1L]]), lubridate::week(.BY[[1L]])),
     keyby = "Date"]
  DT[, Date := NULL]
  # DT[, timestamp := NULL]
  setorderv(DT[, .(col, weekuni)], "col")
}


bench::mark(old(DT1), new(DT2), check = TRUE, filter_gc = FALSE)
#> # A tibble: 2 x 10
#>   expression     min    mean median    max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch:t> <bch:> <bch:>     <dbl> <bch:byt> <dbl> <int>
#> 1 old(DT1)     22.2s   22.2s  22.2s  22.2s    0.0450    2.21GB     4     1
#> 2 new(DT2)      2.8s    2.8s   2.8s   2.8s    0.357     1.42GB     3     1
#> # ... with 1 more variable: total_time <bch:tm>
like image 29
Hugh Avatar answered Jan 30 '23 11:01

Hugh


I made your code run about 50% faster using format instead of paste.

First, I'm not sure the point of anytime for your use case since we can just throw the timestamp into a POSIXct structure almost instantly:

DT[ , localtime := .POSIXct(timestamp, tz = tz)]

Next, I searched around on ?strptime for the ISO-week-based formatting codes to get:

DT[ , weekuni := format(localtime, format = '%G%V')]

I'm not 100% sure this will always be the same as paste(year, week), but it was for your test data; if there is a difference between them, you should ask if that really matters for you.

The only thing I can think of that might be faster would be to use integer arithmetic on the timestamp itself. This is substantially easier if Africa/Addis_Ababa time zone doesn't have any adjustment to its UTC offset in your sample timeframe (unfortunately, it looks like Africa/Addis_Ababa observes daylight savings time, so the UTC offset varies between 2 & 3 hours, making the integer arithmetic approach substantially more difficult)


For the record, using data.table::year and data.table::week is about as fast as the approach used here, but it uses a different definition of "year" and "week" than lubridate (which by default uses the ISO year/week that %G%V does above).

data.table doesn't yet have an isoyear implementation, and data.table::isoweek is substantially slower than lubridate::week.

like image 114
MichaelChirico Avatar answered Jan 30 '23 12:01

MichaelChirico