Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way of joining 2 data tables while summarising the data in one of them

Tags:

join

r

data.table

Problem

I have 2 data tables,

Movement:

  library(data.table)

  consEx = data.table(
  begin = as.POSIXct(c("2019-04-01 00:00:10"," 2019-04-07 10:00:00","2019-04-10 23:00:00","2019-04-12 20:00:00","2019-04-15 10:00:00",
                       "2019-04-20 10:00:00","2019-04-22 13:30:00","2019-04-10 15:30:00","2019-04-12 21:30:00","2019-04-15 20:00:00")),

  end = as.POSIXct(c("2019-04-01 20:00:00","2019-04-07 15:00:00","2019-04-11 10:00:00", "2019-04-12 23:30:00","2019-04-15 15:00:00",
                     "2019-04-21 12:00:00","2019-04-22 17:30:00","2019-04-10 20:00:00","2019-04-13 05:00:00", "2019-04-15 12:30:00")),

  carId = c(1,1,1,2,2,3,3,4,4,5),
  tripId = c(1:10)
)

And Alerts:

alertsEx = data.table(
  timestamp = as.POSIXct(c("2019-04-01 10:00:00","2019-04-01 10:30:00","2019-04-01 15:00:00","2019-04-15 13:00:00","2019-04-22 14:00:00",
                "2019-04-22 15:10:00","2019-04-22 15:40:00","2019-04-10 16:00:00","2019-04-10 17:00:00","2019-04-13 04:00:00")),
  type = c("T1","T2","T1",'T3',"T1","T1","T3","T2","T2","T1"),
  carId = c(1,1,1,2,3,3,3,4,4,4),
  additionalInfo1 = rnorm(10,mean=10,sd=4)
)

The movement table records a period begin - end at which a car was moving. The alerts table shows when an alert happened to the car, containing the type, timestamp and carId

I need to join those 2 tables, summarising the alerts data by type. When there are multiple alerts of the same type during the same period I need to aggregate additionalInfo1 by taking it's mean.

Current Approach

I'm currently doing this by looping through consEx and applying a function to each row that returns a list with the calculations I need

findAlerts = function(begin,end,carId_2){
  saida = alertsEx[timestamp >= begin & timestamp <= end & carId == carId_2,]
  totals = nrow(saida)

  saida = split(saida,by="type")

  resultsList = list(
    "totals" = 0,
    "t1Count" = 0,
    "t1Mean" = 0,
    "t2Count" = 0,
    "t2Mean" = 0,
    "t3Count" = 0,
    "t3Mean" = 0)
  resultsList[["totals"]] = totals

  types = names(saida)
  if("T1" %in% types){
    resultsList[["t1Count"]] = nrow(saida[["T1"]])
    resultsList[["t1Mean"]] = mean(saida[["T1"]]$additionalInfo1)
  }

  if("T2" %in% types){
    resultsList[["t2Count"]] = nrow(saida[["T2"]])
    resultsList[["t2Mean"]] = mean(saida[["T2"]]$additionalInfo1)
  }

  if("T3" %in% types){
    resultsList[["t3Count"]] = nrow(saida[["T3"]])
    resultsList[["t3Mean"]] = mean(saida[["T3"]]$additionalInfo1)
  }

  return(resultsList)
}

for(i in 1:nrow(consEx)){
  aux = findAlerts(consEx$begin[i],consEx$end[i],consEx$carId[i])
  consEx[i,names(aux) := aux]

}

Expected Output

Which gives the expected output:

                 begin                 end carId tripId totals t1Count    t1Mean t2Count    t2Mean t3Count    t3Mean
 1: 2019-04-01 00:00:10 2019-04-01 20:00:00     1      1      3       2 10.123463       1 14.479288       0  0.000000
 2: 2019-04-07 10:00:00 2019-04-07 15:00:00     1      2      0       0  0.000000       0  0.000000       0  0.000000
 3: 2019-04-10 23:00:00 2019-04-11 10:00:00     1      3      0       0  0.000000       0  0.000000       0  0.000000
 4: 2019-04-12 20:00:00 2019-04-12 23:30:00     2      4      0       0  0.000000       0  0.000000       0  0.000000
 5: 2019-04-15 10:00:00 2019-04-15 15:00:00     2      5      1       0  0.000000       0  0.000000       1  6.598062
 6: 2019-04-20 10:00:00 2019-04-21 12:00:00     3      6      0       0  0.000000       0  0.000000       0  0.000000
 7: 2019-04-22 13:30:00 2019-04-22 17:30:00     3      7      3       2  7.610410       0  0.000000       1 10.218593
 8: 2019-04-10 15:30:00 2019-04-10 20:00:00     4      8      2       0  0.000000       2  9.703278       0  0.000000
 9: 2019-04-12 21:30:00 2019-04-13 05:00:00     4      9      1       1  7.095564       0  0.000000       0  0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00     5     10      0       0  0.000000       0  0.000000       0  0.000000

But this is too slow for my original data, which has 13M rows in consEx and 2M in alerts. There are also several different types, where I need to calculate the mean, minimum and maximum for 2 metrics.

Is there a faster way to write the function or the loop?

Additionally, would writing virtually the same code in python give a better result? I'm considering rewriting it, but it would take a long time and I'm not sure it would solve the issue

Thanks

PS: Another approach I tried was looping through alerts and assigning each of them a tripId from consEx and then aggregating the resulting table, but this seems to be even slower.

like image 520
Fino Avatar asked Dec 03 '22 18:12

Fino


2 Answers

Here's my attempt. Let me know if you think it might be right, or if it's fast enough.

EDIT: now that table.express v0.2.0 is out, you could express the calculation as follows:

types <- alertsEx[, unique(type)]

aggregated <- consEx %>%
    start_expr %>%
    right_join(alertsEx, carId, begin <= timestamp, end >= timestamp) %>%
    select(carId, tripId, type, additionalInfo1) %>%
    chain %>%
    group_by(carId, tripId, type) %>%
    transmute(typeCount = .N, typeMean = mean(additionalInfo1)) %>%
    group_by(carId, tripId) %>%
    mutate(totals = sum(typeCount)) %>%
    end_expr

wide <- data.table::dcast(
    aggregated,
    ... ~ type,
    value.var = c("typeCount", "typeMean"),
    sep = "",
    fill = 0
)

# as.character removes dimensions
sd_cols <- as.character(outer(types, c("typeCount", "typeMean"), function(a, b) { paste0(b, a) }))
names(sd_cols) <- as.character(outer(types, c("Count", "Mean"), paste0))
sd_cols <- c(sd_cols, totals = "totals")

consEx %>%
    start_expr %>%
    mutate_join(wide, carId, tripId, .SDcols = sd_cols) %>%
    end_expr %>%
    setnafill(fill = 0) %>%
    setcolorder(c("carId", "tripId", "begin", "end"))

Original answer:

library(data.table)

set.seed(322902L)

consEx = data.table(
  begin = as.POSIXct(c("2019-04-01 00:00:10"," 2019-04-07 10:00:00","2019-04-10 23:00:00","2019-04-12 20:00:00","2019-04-15 10:00:00",
                       "2019-04-20 10:00:00","2019-04-22 13:30:00","2019-04-10 15:30:00","2019-04-12 21:30:00","2019-04-15 20:00:00")),

  end = as.POSIXct(c("2019-04-01 20:00:00","2019-04-07 15:00:00","2019-04-11 10:00:00", "2019-04-12 23:30:00","2019-04-15 15:00:00",
                     "2019-04-21 12:00:00","2019-04-22 17:30:00","2019-04-10 20:00:00","2019-04-13 05:00:00", "2019-04-15 12:30:00")),

  carId = c(1,1,1,2,2,3,3,4,4,5),
  tripId = c(1:10)
)

alertsEx = data.table(
  timestamp = as.POSIXct(c("2019-04-01 10:00:00","2019-04-01 10:30:00","2019-04-01 15:00:00","2019-04-15 13:00:00","2019-04-22 14:00:00",
                           "2019-04-22 15:10:00","2019-04-22 15:40:00","2019-04-10 16:00:00","2019-04-10 17:00:00","2019-04-13 04:00:00")),
  type = c("T1","T2","T1",'T3',"T1","T1","T3","T2","T2","T1"),
  carId = c(1,1,1,2,3,3,3,4,4,4),
  additionalInfo1 = rnorm(10,mean=10,sd=4)
)

types <- unique(alertsEx$type)

joined <- consEx[alertsEx,
                 .(carId, tripId, type, additionalInfo1),
                 on = .(carId, begin <= timestamp, end >= timestamp)]

aggregated <- joined[, .(typeCount = .N, typeMean = mean(additionalInfo1)), by = .(carId, tripId, type)]

totals <- aggregated[, .(totals = sum(typeCount)), by = .(carId, tripId)]

wide <- dcast(aggregated, carId + tripId ~ type, value.var = c("typeCount", "typeMean"), sep = "", fill = 0)

replaceNA <- function(x) { replace(x, is.na(x), 0) }

consEx[, `:=`(as.character(outer(types, c("Count", "Mean"), paste0)),
              lapply(wide[consEx,
                          as.character(outer(types, c("typeCount", "typeMean"),
                                             function(a, b) { paste0(b, a) })),
                          with = FALSE,
                          on = .(carId, tripId)],
                     replaceNA))]

consEx[, totals := lapply(totals[consEx, x.totals, on = .(carId, tripId)], replaceNA)]

setcolorder(consEx, c("carId", "tripId", "begin", "end"))
print(consEx)
    carId tripId               begin                 end T1Count T2Count T3Count    T1Mean    T2Mean    T3Mean totals
 1:     1      1 2019-04-01 00:00:10 2019-04-01 20:00:00       2       1       0 10.458406  9.170923  0.000000      3
 2:     1      2 2019-04-07 10:00:00 2019-04-07 15:00:00       0       0       0  0.000000  0.000000  0.000000      0
 3:     1      3 2019-04-10 23:00:00 2019-04-11 10:00:00       0       0       0  0.000000  0.000000  0.000000      0
 4:     2      4 2019-04-12 20:00:00 2019-04-12 23:30:00       0       0       0  0.000000  0.000000  0.000000      0
 5:     2      5 2019-04-15 10:00:00 2019-04-15 15:00:00       0       0       1  0.000000  0.000000 12.694035      1
 6:     3      6 2019-04-20 10:00:00 2019-04-21 12:00:00       0       0       0  0.000000  0.000000  0.000000      0
 7:     3      7 2019-04-22 13:30:00 2019-04-22 17:30:00       2       0       1 10.998666  0.000000  6.812388      3
 8:     4      8 2019-04-10 15:30:00 2019-04-10 20:00:00       0       2       0  0.000000 11.600738  0.000000      2
 9:     4      9 2019-04-12 21:30:00 2019-04-13 05:00:00       1       0       0  6.369317  0.000000  0.000000      1
10:     5     10 2019-04-15 20:00:00 2019-04-15 12:30:00       0       0       0  0.000000  0.000000  0.000000      0
like image 43
Alexis Avatar answered Dec 11 '22 15:12

Alexis


This answer suggests two data.table approaches which both use non-equi joins for matching alerts to trips and dcast() for reshaping but differ in the way the aggregates are combined with consEx.

The first variant creates a new result dataset leaving constEx unchanged.

The second variant modifies constEx in place. This is similar to Alexis' solution but more concise and using the new setnafill() function (available with the development version 1.12.3 of data.table)

Variant 1: Create new result dataset

agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
              .(tripId, type, additionalInfo1)][
                , .(Count = .N, Mean = mean(additionalInfo1)), 
                by = .(tripId, type)][
                  , totals := sum(Count), by = tripId][] 
result <- dcast(agg[consEx, on = "tripId"], ... ~ type, 
                value.var = c("Count", "Mean"), fill = 0)[
                  , c("Count_NA", "Mean_NA") := NULL][
                    is.na(totals), totals := 0]
setcolorder(result, names(consEx))
result
                  begin                 end carId tripId totals Count_T1 Count_T2 Count_T3   Mean_T1   Mean_T2   Mean_T3
 1: 2019-04-01 00:00:10 2019-04-01 20:00:00     1      1      3        2        1        0 12.654609 12.375862  0.000000
 2: 2019-04-07 10:00:00 2019-04-07 15:00:00     1      2      0        0        0        0  0.000000  0.000000  0.000000
 3: 2019-04-10 23:00:00 2019-04-11 10:00:00     1      3      0        0        0        0  0.000000  0.000000  0.000000
 4: 2019-04-12 20:00:00 2019-04-12 23:30:00     2      4      0        0        0        0  0.000000  0.000000  0.000000
 5: 2019-04-15 10:00:00 2019-04-15 15:00:00     2      5      1        0        0        1  0.000000  0.000000  9.316815
 6: 2019-04-20 10:00:00 2019-04-21 12:00:00     3      6      0        0        0        0  0.000000  0.000000  0.000000
 7: 2019-04-22 13:30:00 2019-04-22 17:30:00     3      7      3        2        0        1  8.586061  0.000000 11.498512
 8: 2019-04-10 15:30:00 2019-04-10 20:00:00     4      8      2        0        2        0  0.000000  8.696356  0.000000
 9: 2019-04-12 21:30:00 2019-04-13 05:00:00     4      9      1        1        0        0  9.343681  0.000000  0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00     5     10      0        0        0        0  0.000000  0.000000  0.000000

Note that the means are likely to differ between the different answers posted as additionalInfo1 is created by rnorm(10, mean = 10, sd = 4) without a call to set.seed() beforehand. Calling set.seed() creates reproducible random numbers.

Explanation of variant 1

The first part finds the matching trips for each alert by a non-equi join and computes the aggregates by tripId and type as well as the total count for each trip:

agg
   tripId type Count      Mean totals
1:      1   T1     2 12.654609      3
2:      1   T2     1 12.375862      3
3:      5   T3     1  9.316815      1
4:      7   T1     2  8.586061      3
5:      7   T3     1 11.498512      3
6:      8   T2     2  8.696356      2
7:      9   T1     1  9.343681      1

Note that tripId is assumed to be a unique key in consEx.

In the next step, the aggregations are right joined to consEx

agg[consEx, on = "tripId"]
    tripId type Count      Mean totals               begin                 end carId
 1:      1   T1     2 12.654609      3 2019-04-01 00:00:10 2019-04-01 20:00:00     1
 2:      1   T2     1 12.375862      3 2019-04-01 00:00:10 2019-04-01 20:00:00     1
 3:      2 <NA>    NA        NA     NA 2019-04-07 10:00:00 2019-04-07 15:00:00     1
 4:      3 <NA>    NA        NA     NA 2019-04-10 23:00:00 2019-04-11 10:00:00     1
 5:      4 <NA>    NA        NA     NA 2019-04-12 20:00:00 2019-04-12 23:30:00     2
 6:      5   T3     1  9.316815      1 2019-04-15 10:00:00 2019-04-15 15:00:00     2
 7:      6 <NA>    NA        NA     NA 2019-04-20 10:00:00 2019-04-21 12:00:00     3
 8:      7   T1     2  8.586061      3 2019-04-22 13:30:00 2019-04-22 17:30:00     3
 9:      7   T3     1 11.498512      3 2019-04-22 13:30:00 2019-04-22 17:30:00     3
10:      8   T2     2  8.696356      2 2019-04-10 15:30:00 2019-04-10 20:00:00     4
11:      9   T1     1  9.343681      1 2019-04-12 21:30:00 2019-04-13 05:00:00     4
12:     10 <NA>    NA        NA     NA 2019-04-15 20:00:00 2019-04-15 12:30:00     5

The output is immediately reshaped from long to wide format using dcast().

Finally, the result is cleaned up by removing superfluous columns, replacing NA by 0, and changing the column order.

Variant 2: Update in place

agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
              .(tripId, type, additionalInfo1)][
                , .(Count = .N, Mean = mean(additionalInfo1)), 
                by = .(tripId, type)][
                  , totals := sum(Count), by = tripId][] 
wide <- dcast(agg, ... ~ type, value.var = c("Count", "Mean"), fill = 0)
consEx[wide, on = "tripId", (names(wide)) := mget(paste0("i.", names(wide)))]
setnafill(consEx, fill = 0) # data.table version 1.12.3+ 

consEx
                  begin                 end carId tripId totals Count_T1 Count_T2 Count_T3   Mean_T1   Mean_T2   Mean_T3
 1: 2019-04-01 00:00:10 2019-04-01 20:00:00     1      1      3        2        1        0 12.654609 12.375862  0.000000
 2: 2019-04-07 10:00:00 2019-04-07 15:00:00     1      2      0        0        0        0  0.000000  0.000000  0.000000
 3: 2019-04-10 23:00:00 2019-04-11 10:00:00     1      3      0        0        0        0  0.000000  0.000000  0.000000
 4: 2019-04-12 20:00:00 2019-04-12 23:30:00     2      4      0        0        0        0  0.000000  0.000000  0.000000
 5: 2019-04-15 10:00:00 2019-04-15 15:00:00     2      5      1        0        0        1  0.000000  0.000000  9.316815
 6: 2019-04-20 10:00:00 2019-04-21 12:00:00     3      6      0        0        0        0  0.000000  0.000000  0.000000
 7: 2019-04-22 13:30:00 2019-04-22 17:30:00     3      7      3        2        0        1  8.586061  0.000000 11.498512
 8: 2019-04-10 15:30:00 2019-04-10 20:00:00     4      8      2        0        2        0  0.000000  8.696356  0.000000
 9: 2019-04-12 21:30:00 2019-04-13 05:00:00     4      9      1        1        0        0  9.343681  0.000000  0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00     5     10      0        0        0        0  0.000000  0.000000  0.000000

Explanation of variant 2

The first part is the same as in variant 1.

In the next step, the aggregates are reshaped from long to wide format.

Then, an update join is performed where consEx is updated in place by the matching rows of wide. The expression

(names(wide)) := mget(paste0("i.", names(wide))) 

is a short cut for

`:=`(totals = i.totals, Count_T1 = i.Count_T1, ..., Mean_T3 = i.Mean_T3)

where the i. prefix refers to columns of wide and is used to avoid ambiguities.

In rows of consEx where there is no match, the appended columns contain NA. So, setnafill() is used to replace all NA values by 0, finally.

Benchmark

At the time of writing, 6 different solutions were posted by

  • Fino the OP in base R
  • Dave2e using dplyr and tidyr
  • Alexandre B in python
  • Alexis using data.table
  • and 2 data.table variant by myself.

The 5 R solutions are compared using press() and mark() from the bench package.

The sample datasets are parameterized to be varied in number of rows. As consEx is modified by some solutions, each benchmark run starts with a fresh copy.

Some minor changes were required to make the codes work with bench. Most remarkably, the shortcut formula ... ~ type in dcast() caused an error when called within bench::mark() and had to be replaced by a version where all variables of the LHS are explicetly named.

As the OP has pointed out that his version is rather slow, the first benchmark run only uses 1000 rows of consEx and 50, 200, and 1000 rows of alertsEx

bm <- bench::press(
  n_trips = 1E3,
  n_alerts = 1E3/c(20, 5, 1),
  {
    n_cars <- 5
    n_periods <- round(n_trips / n_cars)
    n_types = 3
    types = sprintf("T%02i", 1:n_types)
    consEx0 <- data.table(
      tripId = 1:n_trips, 
      carId = rep(1:n_cars, each = n_periods),
      begin = seq.POSIXt(as.POSIXct("2000-01-01"), length.out = n_periods, by = "14 days") %>% 
        rep(n_cars) %>% 
        `mode<-`("double")
    )[, end := begin + 7*24*60*60]
    set.seed(1)
    idx <- sample(n_trips, n_alerts, TRUE)
    alertsEx <- consEx0[
      idx, 
      .(timestamp = begin + runif(n_alerts, 1, 7*24*60*60 - 1),
        type = sample(types, n_alerts, TRUE), 
        carId,
        additionalInfo1 = rnorm(n_alerts, mean = 10, sd = 4))]
    
    bench::mark(
      fino = {
        consEx <- copy(consEx0)
        findAlerts = function(begin,end,carId_2){
          saida = alertsEx[timestamp >= begin & timestamp <= end & carId == carId_2,]
          totals = nrow(saida)
          saida = split(saida,by="type")
          resultsList = list(
            "totals" = 0,
            "t1Count" = 0,
            "t1Mean" = 0,
            "t2Count" = 0,
            "t2Mean" = 0,
            "t3Count" = 0,
            "t3Mean" = 0)
          resultsList[["totals"]] = totals
          types = names(saida)
          if("T1" %in% types){
            resultsList[["t1Count"]] = nrow(saida[["T1"]])
            resultsList[["t1Mean"]] = mean(saida[["T1"]]$additionalInfo1)
          }
          if("T2" %in% types){
            resultsList[["t2Count"]] = nrow(saida[["T2"]])
            resultsList[["t2Mean"]] = mean(saida[["T2"]]$additionalInfo1)
          }
          if("T3" %in% types){
            resultsList[["t3Count"]] = nrow(saida[["T3"]])
            resultsList[["t3Mean"]] = mean(saida[["T3"]]$additionalInfo1)
          }
          return(resultsList)
        }
        for(i in 1:nrow(consEx)){
          aux = findAlerts(consEx$begin[i],consEx$end[i],consEx$carId[i])
          consEx[i,names(aux) := aux]
        }
        consEx[]
      },
      dave = {
        # library(dplyr)
        # library(tidyr)
        
        consEx <- copy(consEx0)
        #split the consEx down to separate carId
        splitcons <- split(consEx, consEx$carId)
        
        alertsEx$tripId <- NA
        #loop and only search the cons that match the alerts
        alertsEx$tripId <- apply(alertsEx, 1, function(a) {
          #retrive the list assicoated to the correct car
          cons <- splitcons[[a[["carId"]]]]
          alerttime <- as.POSIXct(a[["timestamp"]])
          #find the trip which contains the alerttime
          tripId <-
            which((alerttime >= cons$begin) & (alerttime <= cons$end))
          #identify and return the tripId
          cons$tripId[tripId]
        })
        
        #Referenced:
        #https://stackoverflow.com/questions/30592094/r-spreading-multiple-columns-with-tidyr
        #https://stackoverflow.com/questions/35321497/spread-multiple-columns-tidyr
        alertsEx2 <-
          alertsEx %>% 
          dplyr::group_by(carId, tripId, type) %>% 
          dplyr::summarize(mean = mean(additionalInfo1), count = dplyr::n())
        alerttable <-
          tidyr::gather(alertsEx2, variable, val, -(carId:type), na.rm = TRUE) %>%
          tidyr::unite(temp, type, variable) %>%
          tidyr::spread(temp, val, fill = 0)
        consEx %>% 
          dplyr::left_join(alerttable, by = c("tripId", "carId")) 
        
      },
      alexis = {
        consEx <- copy(consEx0)
        joined <- consEx[alertsEx,
                         .(carId, tripId, type, additionalInfo1),
                         on = .(carId, begin <= timestamp, end >= timestamp)]
        aggregated <- joined[, .(typeCount = .N, typeMean = mean(additionalInfo1)), by = .(carId, tripId, type)]
        totals <- aggregated[, .(totals = sum(typeCount)), by = .(carId, tripId)]
        long <- dcast(aggregated, carId + tripId ~ type, value.var = c("typeCount", "typeMean"), sep = "", fill = 0)
        replaceNA <- function(x) { replace(x, is.na(x), 0) }
        consEx[, `:=`(as.character(outer(types, c("Count", "Mean"), paste0)),
                      lapply(long[consEx,
                                  as.character(outer(types, c("typeCount", "typeMean"),
                                                     function(a, b) { paste0(b, a) })),
                                  with = FALSE,
                                  on = .(carId, tripId)],
                             replaceNA))]
        consEx[, totals := lapply(totals[consEx, x.totals, on = .(carId, tripId)], replaceNA)]
        setcolorder(consEx, c("carId", "tripId", "begin", "end"))
        consEx
      },
      uwe1 = {
        consEx <- copy(consEx0)
        agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
                      .(tripId, type, additionalInfo1)][
                        , .(Count = .N, Mean = mean(additionalInfo1)), 
                        by = .(tripId, type)][
                          , totals := sum(Count), by = tripId][] 
        result <- dcast(agg[consEx, on = "tripId"], 
                        tripId + carId + begin+ end + totals ~ type, 
                        value.var = c("Count", "Mean"), fill = 0)[
                          , c("Count_NA", "Mean_NA") := NULL][
                            is.na(totals), totals := 0][]
        # setcolorder(result, names(consEx))
        result
      },
      uwe2 = {
        consEx <- copy(consEx0)
        agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
                      .(tripId, type, additionalInfo1)][
                        , .(Count = .N, Mean = mean(additionalInfo1)), 
                        by = .(tripId, type)][
                          , totals := sum(Count), by = tripId][] 
        wide <- dcast(agg, tripId + totals ~ type, value.var = c("Count", "Mean"), fill = 0)
        consEx[wide, on = "tripId", (names(wide)) := mget(paste0("i.", names(wide)))]
        setnafill(consEx, fill = 0) # data.table version 1.12.3+ 
        consEx
      },
      check = FALSE
    )
  }
)

library(ggplot2)
autoplot(bm)

enter image description here

Please, note the logarithmic time scale.

The chart confirms OP's assumption that his approach is magnitudes slower than any other solutions. It took 3 seconds for 1000 rows while OP's production dataset has 12 M rows. The run time of Dave2e's approach increases with the number of alerts.

Also, OP's approach is most demanding concerning allocated memory. It allocates up to 190 MBytes while the data.table versions only require 1 MByts.

bm
# A tibble: 15 x 15
   expression n_trips n_alerts      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
   <bch:expr>   <dbl>    <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
 1 fino          1000       50    2.95s    2.95s     0.339  149.25MB     3.39     1    10      2.95s <data~ <Rpro~ <bch~
 2 dave          1000       50  21.52ms  22.61ms    41.6    779.46KB     5.94    21     3   505.01ms <data~ <Rpro~ <bch~
 3 alexis        1000       50  21.98ms  23.09ms    41.2   1007.84KB     3.93    21     2    509.5ms <data~ <Rpro~ <bch~
 4 uwe1          1000       50  14.47ms  15.45ms    62.3    919.12KB     1.95    32     1   513.94ms <data~ <Rpro~ <bch~
 5 uwe2          1000       50  14.14ms  14.73ms    64.2     633.2KB     1.95    33     1   513.91ms <data~ <Rpro~ <bch~
 6 fino          1000      200    3.09s    3.09s     0.323  155.12MB     4.53     1    14      3.09s <data~ <Rpro~ <bch~
 7 dave          1000      200   53.4ms  62.11ms    11.0       1.7MB     5.49     8     4   728.02ms <data~ <Rpro~ <bch~
 8 alexis        1000      200  22.42ms  23.34ms    41.5      1.03MB     3.77    22     2   530.25ms <data~ <Rpro~ <bch~
 9 uwe1          1000      200  14.72ms  15.76ms    62.2    953.07KB     0       32     0   514.46ms <data~ <Rpro~ <bch~
10 uwe2          1000      200  14.49ms  15.17ms    63.6    695.63KB     1.99    32     1    503.4ms <data~ <Rpro~ <bch~
11 fino          1000     1000     3.6s     3.6s     0.278  187.32MB     3.61     1    13       3.6s <data~ <Rpro~ <bch~
12 dave          1000     1000 242.46ms  243.7ms     4.07      6.5MB     6.78     3     5   737.06ms <data~ <Rpro~ <bch~
13 alexis        1000     1000  24.32ms  25.78ms    37.6      1.21MB     3.95    19     2   505.84ms <data~ <Rpro~ <bch~
14 uwe1          1000     1000  16.04ms   16.8ms    56.8      1.05MB     1.96    29     1   510.23ms <data~ <Rpro~ <bch~
15 uwe2          1000     1000  15.69ms  16.41ms    54.8    938.74KB     3.92    28     2   510.63ms <data~ <Rpro~ <bch~

So, Fino's approach is omitted from the second benchmark run with larger problem sizes of 10k and 100k rows for consEx and 2k and 20k rows for alertsEx, resp.

enter image description here

As Dave2e's approach is about 100 times slower for 20k rows than the fastest approach it is omitted from the third run. This will simulate OP's production dataset of 12M rows consEx and 2M rows of alertsEx:

enter image description here

print(bm, n = Inf)
# A tibble: 27 x 15
   expression n_trips n_alerts      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
   <bch:expr>   <dbl>    <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
 1 alexis      1.00e5    20000 747.87ms 747.87ms  1.34       38.18MB  6.69        1     5   747.87ms <data~ <Rpro~ <bch~
 2 uwe1        1.00e5    20000 124.52ms 138.89ms  7.32       37.88MB  3.66        4     2   546.17ms <data~ <Rpro~ <bch~
 3 uwe2        1.00e5    20000  72.29ms  73.76ms 11.3        16.63MB  1.88        6     1   533.26ms <data~ <Rpro~ <bch~
 4 alexis      1.00e6    20000    7.51s    7.51s  0.133     335.44MB  3.33        1    25      7.51s <data~ <Rpro~ <bch~
 5 uwe1        1.00e6    20000 995.31ms 995.31ms  1.00      346.94MB  2.01        1     2   995.31ms <data~ <Rpro~ <bch~
 6 uwe2        1.00e6    20000 714.19ms 714.19ms  1.40       89.27MB  1.40        1     1   714.19ms <data~ <Rpro~ <bch~
 7 alexis      1.00e7    20000    1.67m    1.67m  0.00999     3.21GB  0.340       1    34      1.67m <data~ <Rpro~ <bch~
 8 uwe1        1.00e7    20000    3.44m    3.44m  0.00485     3.41GB  0.00969     1     2      3.44m <data~ <Rpro~ <bch~
 9 uwe2        1.00e7    20000   42.51s   42.51s  0.0235     810.3MB  0           1     0     42.51s <data~ <Rpro~ <bch~
10 alexis      1.00e5   200000   15.38s   15.38s  0.0650     73.22MB  0.0650      1     1     15.38s <data~ <Rpro~ <bch~
11 uwe1        1.00e5   200000    1.34s    1.34s  0.747      63.81MB  0           1     0      1.34s <data~ <Rpro~ <bch~
12 uwe2        1.00e5   200000    1.47s    1.47s  0.681      58.98MB  0           1     0      1.47s <data~ <Rpro~ <bch~
13 alexis      1.00e6   200000    2.91m    2.91m  0.00573   375.93MB  0.0115      1     2      2.91m <data~ <Rpro~ <bch~
14 uwe1        1.00e6   200000    9.72s    9.72s  0.103     371.69MB  0           1     0      9.72s <data~ <Rpro~ <bch~
15 uwe2        1.00e6   200000 888.67ms 888.67ms  1.13      161.82MB  0           1     0   888.67ms <data~ <Rpro~ <bch~
16 alexis      1.00e7   200000    6.29m    6.29m  0.00265     3.15GB  0.0928      1    35      6.29m <data~ <Rpro~ <bch~
17 uwe1        1.00e7   200000    2.45m    2.45m  0.00681     3.43GB  0.0136      1     2      2.45m <data~ <Rpro~ <bch~
18 uwe2        1.00e7   200000   12.48m   12.48m  0.00134   887.99MB  0.00134     1     1     12.48m <data~ <Rpro~ <bch~
19 alexis      1.00e5  2000000    3.04s    3.04s  0.329        207MB  0           1     0      3.04s <data~ <Rpro~ <bch~
20 uwe1        1.00e5  2000000    2.96s    2.96s  0.338     196.42MB  0           1     0      2.96s <data~ <Rpro~ <bch~
21 uwe2        1.00e5  2000000    2.81s    2.81s  0.355     187.79MB  0           1     0      2.81s <data~ <Rpro~ <bch~
22 alexis      1.00e6  2000000    6.96m    6.96m  0.00239   726.14MB  0.00479     1     2      6.96m <data~ <Rpro~ <bch~
23 uwe1        1.00e6  2000000    2.01m    2.01m  0.00827    631.1MB  0           1     0      2.01m <data~ <Rpro~ <bch~
24 uwe2        1.00e6  2000000   30.54s   30.54s  0.0327    584.81MB  0           1     0     30.54s <data~ <Rpro~ <bch~
25 alexis      1.00e7  2000000   31.54m   31.54m  0.000528    3.66GB  0.0127      1    24     31.54m <data~ <Rpro~ <bch~
26 uwe1        1.00e7  2000000    8.72m    8.72m  0.00191     3.67GB  0           1     0      8.72m <data~ <Rpro~ <bch~
27 uwe2        1.00e7  2000000   12.35m   12.35m  0.00135     1.58GB  0           1     0     12.35m <data~ <Rpro~ <bch~

Caveat: Please, note that run times for most of the parameter variations are only confirmed by one measurement. In addition, garbage collection (gc) may have an influence due to the limited amount of RAM on my computer (8 GBytes). So, your mileage may vary.

The timings are not consistent. In general, uwe2 (in-place update) is substantially faster than uwe1 (new result set) which confirms jangorecki's assumption. alexis is almost always the slowest. (Please note the logarithmic time scale in the chart or compare the timimgs in the table.)

However, some parameter variations show deviations from above pattern. For 10M trips uwe1 is faster than uwe2 in two cases and alexis is also catching up. These effects might be due to garbage collection.

Memory allocation is much less for the in-place update approach (even if this does not always pay out in terms of speed).

Again, interpret the results with care. The third benchmark run should be repeated on a larger machine and with more patience to allow for repeated measurements.

like image 93
Uwe Avatar answered Dec 11 '22 17:12

Uwe