I have 2 data tables,
Movement:
library(data.table)
consEx = data.table(
begin = as.POSIXct(c("2019-04-01 00:00:10"," 2019-04-07 10:00:00","2019-04-10 23:00:00","2019-04-12 20:00:00","2019-04-15 10:00:00",
"2019-04-20 10:00:00","2019-04-22 13:30:00","2019-04-10 15:30:00","2019-04-12 21:30:00","2019-04-15 20:00:00")),
end = as.POSIXct(c("2019-04-01 20:00:00","2019-04-07 15:00:00","2019-04-11 10:00:00", "2019-04-12 23:30:00","2019-04-15 15:00:00",
"2019-04-21 12:00:00","2019-04-22 17:30:00","2019-04-10 20:00:00","2019-04-13 05:00:00", "2019-04-15 12:30:00")),
carId = c(1,1,1,2,2,3,3,4,4,5),
tripId = c(1:10)
)
And Alerts:
alertsEx = data.table(
timestamp = as.POSIXct(c("2019-04-01 10:00:00","2019-04-01 10:30:00","2019-04-01 15:00:00","2019-04-15 13:00:00","2019-04-22 14:00:00",
"2019-04-22 15:10:00","2019-04-22 15:40:00","2019-04-10 16:00:00","2019-04-10 17:00:00","2019-04-13 04:00:00")),
type = c("T1","T2","T1",'T3',"T1","T1","T3","T2","T2","T1"),
carId = c(1,1,1,2,3,3,3,4,4,4),
additionalInfo1 = rnorm(10,mean=10,sd=4)
)
The movement table records a period begin
- end
at which a car was moving. The alerts table shows when an alert happened to the car, containing the type
, timestamp
and carId
I need to join those 2 tables, summarising the alerts data by type
. When there are multiple alerts of the same type during the same period I need to aggregate additionalInfo1
by taking it's mean.
I'm currently doing this by looping through consEx
and applying a function to each row that returns a list with the calculations I need
findAlerts = function(begin,end,carId_2){
saida = alertsEx[timestamp >= begin & timestamp <= end & carId == carId_2,]
totals = nrow(saida)
saida = split(saida,by="type")
resultsList = list(
"totals" = 0,
"t1Count" = 0,
"t1Mean" = 0,
"t2Count" = 0,
"t2Mean" = 0,
"t3Count" = 0,
"t3Mean" = 0)
resultsList[["totals"]] = totals
types = names(saida)
if("T1" %in% types){
resultsList[["t1Count"]] = nrow(saida[["T1"]])
resultsList[["t1Mean"]] = mean(saida[["T1"]]$additionalInfo1)
}
if("T2" %in% types){
resultsList[["t2Count"]] = nrow(saida[["T2"]])
resultsList[["t2Mean"]] = mean(saida[["T2"]]$additionalInfo1)
}
if("T3" %in% types){
resultsList[["t3Count"]] = nrow(saida[["T3"]])
resultsList[["t3Mean"]] = mean(saida[["T3"]]$additionalInfo1)
}
return(resultsList)
}
for(i in 1:nrow(consEx)){
aux = findAlerts(consEx$begin[i],consEx$end[i],consEx$carId[i])
consEx[i,names(aux) := aux]
}
Which gives the expected output:
begin end carId tripId totals t1Count t1Mean t2Count t2Mean t3Count t3Mean
1: 2019-04-01 00:00:10 2019-04-01 20:00:00 1 1 3 2 10.123463 1 14.479288 0 0.000000
2: 2019-04-07 10:00:00 2019-04-07 15:00:00 1 2 0 0 0.000000 0 0.000000 0 0.000000
3: 2019-04-10 23:00:00 2019-04-11 10:00:00 1 3 0 0 0.000000 0 0.000000 0 0.000000
4: 2019-04-12 20:00:00 2019-04-12 23:30:00 2 4 0 0 0.000000 0 0.000000 0 0.000000
5: 2019-04-15 10:00:00 2019-04-15 15:00:00 2 5 1 0 0.000000 0 0.000000 1 6.598062
6: 2019-04-20 10:00:00 2019-04-21 12:00:00 3 6 0 0 0.000000 0 0.000000 0 0.000000
7: 2019-04-22 13:30:00 2019-04-22 17:30:00 3 7 3 2 7.610410 0 0.000000 1 10.218593
8: 2019-04-10 15:30:00 2019-04-10 20:00:00 4 8 2 0 0.000000 2 9.703278 0 0.000000
9: 2019-04-12 21:30:00 2019-04-13 05:00:00 4 9 1 1 7.095564 0 0.000000 0 0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00 5 10 0 0 0.000000 0 0.000000 0 0.000000
But this is too slow for my original data, which has 13M rows in consEx
and 2M in alerts. There are also several different types, where I need to calculate the mean, minimum and maximum for 2 metrics.
Is there a faster way to write the function or the loop?
Additionally, would writing virtually the same code in python give a better result? I'm considering rewriting it, but it would take a long time and I'm not sure it would solve the issue
Thanks
PS: Another approach I tried was looping through alerts and assigning each of them a tripId
from consEx
and then aggregating the resulting table, but this seems to be even slower.
Here's my attempt. Let me know if you think it might be right, or if it's fast enough.
EDIT: now that table.express
v0.2.0 is out,
you could express the calculation as follows:
types <- alertsEx[, unique(type)]
aggregated <- consEx %>%
start_expr %>%
right_join(alertsEx, carId, begin <= timestamp, end >= timestamp) %>%
select(carId, tripId, type, additionalInfo1) %>%
chain %>%
group_by(carId, tripId, type) %>%
transmute(typeCount = .N, typeMean = mean(additionalInfo1)) %>%
group_by(carId, tripId) %>%
mutate(totals = sum(typeCount)) %>%
end_expr
wide <- data.table::dcast(
aggregated,
... ~ type,
value.var = c("typeCount", "typeMean"),
sep = "",
fill = 0
)
# as.character removes dimensions
sd_cols <- as.character(outer(types, c("typeCount", "typeMean"), function(a, b) { paste0(b, a) }))
names(sd_cols) <- as.character(outer(types, c("Count", "Mean"), paste0))
sd_cols <- c(sd_cols, totals = "totals")
consEx %>%
start_expr %>%
mutate_join(wide, carId, tripId, .SDcols = sd_cols) %>%
end_expr %>%
setnafill(fill = 0) %>%
setcolorder(c("carId", "tripId", "begin", "end"))
Original answer:
library(data.table)
set.seed(322902L)
consEx = data.table(
begin = as.POSIXct(c("2019-04-01 00:00:10"," 2019-04-07 10:00:00","2019-04-10 23:00:00","2019-04-12 20:00:00","2019-04-15 10:00:00",
"2019-04-20 10:00:00","2019-04-22 13:30:00","2019-04-10 15:30:00","2019-04-12 21:30:00","2019-04-15 20:00:00")),
end = as.POSIXct(c("2019-04-01 20:00:00","2019-04-07 15:00:00","2019-04-11 10:00:00", "2019-04-12 23:30:00","2019-04-15 15:00:00",
"2019-04-21 12:00:00","2019-04-22 17:30:00","2019-04-10 20:00:00","2019-04-13 05:00:00", "2019-04-15 12:30:00")),
carId = c(1,1,1,2,2,3,3,4,4,5),
tripId = c(1:10)
)
alertsEx = data.table(
timestamp = as.POSIXct(c("2019-04-01 10:00:00","2019-04-01 10:30:00","2019-04-01 15:00:00","2019-04-15 13:00:00","2019-04-22 14:00:00",
"2019-04-22 15:10:00","2019-04-22 15:40:00","2019-04-10 16:00:00","2019-04-10 17:00:00","2019-04-13 04:00:00")),
type = c("T1","T2","T1",'T3',"T1","T1","T3","T2","T2","T1"),
carId = c(1,1,1,2,3,3,3,4,4,4),
additionalInfo1 = rnorm(10,mean=10,sd=4)
)
types <- unique(alertsEx$type)
joined <- consEx[alertsEx,
.(carId, tripId, type, additionalInfo1),
on = .(carId, begin <= timestamp, end >= timestamp)]
aggregated <- joined[, .(typeCount = .N, typeMean = mean(additionalInfo1)), by = .(carId, tripId, type)]
totals <- aggregated[, .(totals = sum(typeCount)), by = .(carId, tripId)]
wide <- dcast(aggregated, carId + tripId ~ type, value.var = c("typeCount", "typeMean"), sep = "", fill = 0)
replaceNA <- function(x) { replace(x, is.na(x), 0) }
consEx[, `:=`(as.character(outer(types, c("Count", "Mean"), paste0)),
lapply(wide[consEx,
as.character(outer(types, c("typeCount", "typeMean"),
function(a, b) { paste0(b, a) })),
with = FALSE,
on = .(carId, tripId)],
replaceNA))]
consEx[, totals := lapply(totals[consEx, x.totals, on = .(carId, tripId)], replaceNA)]
setcolorder(consEx, c("carId", "tripId", "begin", "end"))
print(consEx)
carId tripId begin end T1Count T2Count T3Count T1Mean T2Mean T3Mean totals
1: 1 1 2019-04-01 00:00:10 2019-04-01 20:00:00 2 1 0 10.458406 9.170923 0.000000 3
2: 1 2 2019-04-07 10:00:00 2019-04-07 15:00:00 0 0 0 0.000000 0.000000 0.000000 0
3: 1 3 2019-04-10 23:00:00 2019-04-11 10:00:00 0 0 0 0.000000 0.000000 0.000000 0
4: 2 4 2019-04-12 20:00:00 2019-04-12 23:30:00 0 0 0 0.000000 0.000000 0.000000 0
5: 2 5 2019-04-15 10:00:00 2019-04-15 15:00:00 0 0 1 0.000000 0.000000 12.694035 1
6: 3 6 2019-04-20 10:00:00 2019-04-21 12:00:00 0 0 0 0.000000 0.000000 0.000000 0
7: 3 7 2019-04-22 13:30:00 2019-04-22 17:30:00 2 0 1 10.998666 0.000000 6.812388 3
8: 4 8 2019-04-10 15:30:00 2019-04-10 20:00:00 0 2 0 0.000000 11.600738 0.000000 2
9: 4 9 2019-04-12 21:30:00 2019-04-13 05:00:00 1 0 0 6.369317 0.000000 0.000000 1
10: 5 10 2019-04-15 20:00:00 2019-04-15 12:30:00 0 0 0 0.000000 0.000000 0.000000 0
This answer suggests two data.table
approaches which both use non-equi joins for matching alerts to trips and dcast()
for reshaping but differ
in the way the aggregates are combined with consEx
.
The first variant creates a new result
dataset leaving constEx
unchanged.
The second variant modifies constEx
in place. This is similar to Alexis' solution but more concise and using the new setnafill()
function (available with the development version 1.12.3 of data.table
)
agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp),
.(tripId, type, additionalInfo1)][
, .(Count = .N, Mean = mean(additionalInfo1)),
by = .(tripId, type)][
, totals := sum(Count), by = tripId][]
result <- dcast(agg[consEx, on = "tripId"], ... ~ type,
value.var = c("Count", "Mean"), fill = 0)[
, c("Count_NA", "Mean_NA") := NULL][
is.na(totals), totals := 0]
setcolorder(result, names(consEx))
result
begin end carId tripId totals Count_T1 Count_T2 Count_T3 Mean_T1 Mean_T2 Mean_T3 1: 2019-04-01 00:00:10 2019-04-01 20:00:00 1 1 3 2 1 0 12.654609 12.375862 0.000000 2: 2019-04-07 10:00:00 2019-04-07 15:00:00 1 2 0 0 0 0 0.000000 0.000000 0.000000 3: 2019-04-10 23:00:00 2019-04-11 10:00:00 1 3 0 0 0 0 0.000000 0.000000 0.000000 4: 2019-04-12 20:00:00 2019-04-12 23:30:00 2 4 0 0 0 0 0.000000 0.000000 0.000000 5: 2019-04-15 10:00:00 2019-04-15 15:00:00 2 5 1 0 0 1 0.000000 0.000000 9.316815 6: 2019-04-20 10:00:00 2019-04-21 12:00:00 3 6 0 0 0 0 0.000000 0.000000 0.000000 7: 2019-04-22 13:30:00 2019-04-22 17:30:00 3 7 3 2 0 1 8.586061 0.000000 11.498512 8: 2019-04-10 15:30:00 2019-04-10 20:00:00 4 8 2 0 2 0 0.000000 8.696356 0.000000 9: 2019-04-12 21:30:00 2019-04-13 05:00:00 4 9 1 1 0 0 9.343681 0.000000 0.000000 10: 2019-04-15 20:00:00 2019-04-15 12:30:00 5 10 0 0 0 0 0.000000 0.000000 0.000000
Note that the means are likely to differ between the different answers posted as additionalInfo1
is created by rnorm(10, mean = 10, sd = 4)
without a call to set.seed()
beforehand. Calling set.seed()
creates reproducible random numbers.
The first part finds the matching trips for each alert by a non-equi join and computes the aggregates by tripId
and type
as well as the total count for each trip:
agg
tripId type Count Mean totals 1: 1 T1 2 12.654609 3 2: 1 T2 1 12.375862 3 3: 5 T3 1 9.316815 1 4: 7 T1 2 8.586061 3 5: 7 T3 1 11.498512 3 6: 8 T2 2 8.696356 2 7: 9 T1 1 9.343681 1
Note that tripId
is assumed to be a unique key in consEx
.
In the next step, the aggregations are right joined to consEx
agg[consEx, on = "tripId"]
tripId type Count Mean totals begin end carId 1: 1 T1 2 12.654609 3 2019-04-01 00:00:10 2019-04-01 20:00:00 1 2: 1 T2 1 12.375862 3 2019-04-01 00:00:10 2019-04-01 20:00:00 1 3: 2 <NA> NA NA NA 2019-04-07 10:00:00 2019-04-07 15:00:00 1 4: 3 <NA> NA NA NA 2019-04-10 23:00:00 2019-04-11 10:00:00 1 5: 4 <NA> NA NA NA 2019-04-12 20:00:00 2019-04-12 23:30:00 2 6: 5 T3 1 9.316815 1 2019-04-15 10:00:00 2019-04-15 15:00:00 2 7: 6 <NA> NA NA NA 2019-04-20 10:00:00 2019-04-21 12:00:00 3 8: 7 T1 2 8.586061 3 2019-04-22 13:30:00 2019-04-22 17:30:00 3 9: 7 T3 1 11.498512 3 2019-04-22 13:30:00 2019-04-22 17:30:00 3 10: 8 T2 2 8.696356 2 2019-04-10 15:30:00 2019-04-10 20:00:00 4 11: 9 T1 1 9.343681 1 2019-04-12 21:30:00 2019-04-13 05:00:00 4 12: 10 <NA> NA NA NA 2019-04-15 20:00:00 2019-04-15 12:30:00 5
The output is immediately reshaped from long to wide format using dcast()
.
Finally, the result is cleaned up by removing superfluous columns, replacing NA
by 0, and changing the column order.
agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp),
.(tripId, type, additionalInfo1)][
, .(Count = .N, Mean = mean(additionalInfo1)),
by = .(tripId, type)][
, totals := sum(Count), by = tripId][]
wide <- dcast(agg, ... ~ type, value.var = c("Count", "Mean"), fill = 0)
consEx[wide, on = "tripId", (names(wide)) := mget(paste0("i.", names(wide)))]
setnafill(consEx, fill = 0) # data.table version 1.12.3+
consEx
begin end carId tripId totals Count_T1 Count_T2 Count_T3 Mean_T1 Mean_T2 Mean_T3 1: 2019-04-01 00:00:10 2019-04-01 20:00:00 1 1 3 2 1 0 12.654609 12.375862 0.000000 2: 2019-04-07 10:00:00 2019-04-07 15:00:00 1 2 0 0 0 0 0.000000 0.000000 0.000000 3: 2019-04-10 23:00:00 2019-04-11 10:00:00 1 3 0 0 0 0 0.000000 0.000000 0.000000 4: 2019-04-12 20:00:00 2019-04-12 23:30:00 2 4 0 0 0 0 0.000000 0.000000 0.000000 5: 2019-04-15 10:00:00 2019-04-15 15:00:00 2 5 1 0 0 1 0.000000 0.000000 9.316815 6: 2019-04-20 10:00:00 2019-04-21 12:00:00 3 6 0 0 0 0 0.000000 0.000000 0.000000 7: 2019-04-22 13:30:00 2019-04-22 17:30:00 3 7 3 2 0 1 8.586061 0.000000 11.498512 8: 2019-04-10 15:30:00 2019-04-10 20:00:00 4 8 2 0 2 0 0.000000 8.696356 0.000000 9: 2019-04-12 21:30:00 2019-04-13 05:00:00 4 9 1 1 0 0 9.343681 0.000000 0.000000 10: 2019-04-15 20:00:00 2019-04-15 12:30:00 5 10 0 0 0 0 0.000000 0.000000 0.000000
The first part is the same as in variant 1.
In the next step, the aggregates are reshaped from long to wide format.
Then, an update join is performed where consEx
is updated in place by the matching rows of wide
. The expression
(names(wide)) := mget(paste0("i.", names(wide)))
is a short cut for
`:=`(totals = i.totals, Count_T1 = i.Count_T1, ..., Mean_T3 = i.Mean_T3)
where the i.
prefix refers to columns of wide
and is used to avoid ambiguities.
In rows of consEx
where there is no match, the appended columns contain NA
. So, setnafill()
is used to replace all NA
values by 0, finally.
At the time of writing, 6 different solutions were posted by
R
dplyr
and tidyr
python
data.table
data.table
variant by myself.The 5 R
solutions are compared using press()
and mark()
from the bench
package.
The sample datasets are parameterized to be varied in number of rows. As consEx
is modified by some solutions, each benchmark run starts with a fresh copy.
Some minor changes were required to make the codes work with bench
. Most remarkably, the shortcut formula ... ~ type
in dcast()
caused an error when called within bench::mark()
and had to be replaced by a version where all variables of the LHS are explicetly named.
As the OP has pointed out that his version is rather slow, the first benchmark run only uses 1000 rows of consEx
and 50, 200, and 1000 rows of alertsEx
bm <- bench::press(
n_trips = 1E3,
n_alerts = 1E3/c(20, 5, 1),
{
n_cars <- 5
n_periods <- round(n_trips / n_cars)
n_types = 3
types = sprintf("T%02i", 1:n_types)
consEx0 <- data.table(
tripId = 1:n_trips,
carId = rep(1:n_cars, each = n_periods),
begin = seq.POSIXt(as.POSIXct("2000-01-01"), length.out = n_periods, by = "14 days") %>%
rep(n_cars) %>%
`mode<-`("double")
)[, end := begin + 7*24*60*60]
set.seed(1)
idx <- sample(n_trips, n_alerts, TRUE)
alertsEx <- consEx0[
idx,
.(timestamp = begin + runif(n_alerts, 1, 7*24*60*60 - 1),
type = sample(types, n_alerts, TRUE),
carId,
additionalInfo1 = rnorm(n_alerts, mean = 10, sd = 4))]
bench::mark(
fino = {
consEx <- copy(consEx0)
findAlerts = function(begin,end,carId_2){
saida = alertsEx[timestamp >= begin & timestamp <= end & carId == carId_2,]
totals = nrow(saida)
saida = split(saida,by="type")
resultsList = list(
"totals" = 0,
"t1Count" = 0,
"t1Mean" = 0,
"t2Count" = 0,
"t2Mean" = 0,
"t3Count" = 0,
"t3Mean" = 0)
resultsList[["totals"]] = totals
types = names(saida)
if("T1" %in% types){
resultsList[["t1Count"]] = nrow(saida[["T1"]])
resultsList[["t1Mean"]] = mean(saida[["T1"]]$additionalInfo1)
}
if("T2" %in% types){
resultsList[["t2Count"]] = nrow(saida[["T2"]])
resultsList[["t2Mean"]] = mean(saida[["T2"]]$additionalInfo1)
}
if("T3" %in% types){
resultsList[["t3Count"]] = nrow(saida[["T3"]])
resultsList[["t3Mean"]] = mean(saida[["T3"]]$additionalInfo1)
}
return(resultsList)
}
for(i in 1:nrow(consEx)){
aux = findAlerts(consEx$begin[i],consEx$end[i],consEx$carId[i])
consEx[i,names(aux) := aux]
}
consEx[]
},
dave = {
# library(dplyr)
# library(tidyr)
consEx <- copy(consEx0)
#split the consEx down to separate carId
splitcons <- split(consEx, consEx$carId)
alertsEx$tripId <- NA
#loop and only search the cons that match the alerts
alertsEx$tripId <- apply(alertsEx, 1, function(a) {
#retrive the list assicoated to the correct car
cons <- splitcons[[a[["carId"]]]]
alerttime <- as.POSIXct(a[["timestamp"]])
#find the trip which contains the alerttime
tripId <-
which((alerttime >= cons$begin) & (alerttime <= cons$end))
#identify and return the tripId
cons$tripId[tripId]
})
#Referenced:
#https://stackoverflow.com/questions/30592094/r-spreading-multiple-columns-with-tidyr
#https://stackoverflow.com/questions/35321497/spread-multiple-columns-tidyr
alertsEx2 <-
alertsEx %>%
dplyr::group_by(carId, tripId, type) %>%
dplyr::summarize(mean = mean(additionalInfo1), count = dplyr::n())
alerttable <-
tidyr::gather(alertsEx2, variable, val, -(carId:type), na.rm = TRUE) %>%
tidyr::unite(temp, type, variable) %>%
tidyr::spread(temp, val, fill = 0)
consEx %>%
dplyr::left_join(alerttable, by = c("tripId", "carId"))
},
alexis = {
consEx <- copy(consEx0)
joined <- consEx[alertsEx,
.(carId, tripId, type, additionalInfo1),
on = .(carId, begin <= timestamp, end >= timestamp)]
aggregated <- joined[, .(typeCount = .N, typeMean = mean(additionalInfo1)), by = .(carId, tripId, type)]
totals <- aggregated[, .(totals = sum(typeCount)), by = .(carId, tripId)]
long <- dcast(aggregated, carId + tripId ~ type, value.var = c("typeCount", "typeMean"), sep = "", fill = 0)
replaceNA <- function(x) { replace(x, is.na(x), 0) }
consEx[, `:=`(as.character(outer(types, c("Count", "Mean"), paste0)),
lapply(long[consEx,
as.character(outer(types, c("typeCount", "typeMean"),
function(a, b) { paste0(b, a) })),
with = FALSE,
on = .(carId, tripId)],
replaceNA))]
consEx[, totals := lapply(totals[consEx, x.totals, on = .(carId, tripId)], replaceNA)]
setcolorder(consEx, c("carId", "tripId", "begin", "end"))
consEx
},
uwe1 = {
consEx <- copy(consEx0)
agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp),
.(tripId, type, additionalInfo1)][
, .(Count = .N, Mean = mean(additionalInfo1)),
by = .(tripId, type)][
, totals := sum(Count), by = tripId][]
result <- dcast(agg[consEx, on = "tripId"],
tripId + carId + begin+ end + totals ~ type,
value.var = c("Count", "Mean"), fill = 0)[
, c("Count_NA", "Mean_NA") := NULL][
is.na(totals), totals := 0][]
# setcolorder(result, names(consEx))
result
},
uwe2 = {
consEx <- copy(consEx0)
agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp),
.(tripId, type, additionalInfo1)][
, .(Count = .N, Mean = mean(additionalInfo1)),
by = .(tripId, type)][
, totals := sum(Count), by = tripId][]
wide <- dcast(agg, tripId + totals ~ type, value.var = c("Count", "Mean"), fill = 0)
consEx[wide, on = "tripId", (names(wide)) := mget(paste0("i.", names(wide)))]
setnafill(consEx, fill = 0) # data.table version 1.12.3+
consEx
},
check = FALSE
)
}
)
library(ggplot2)
autoplot(bm)
Please, note the logarithmic time scale.
The chart confirms OP's assumption that his approach is magnitudes slower than any other solutions. It took 3 seconds for 1000 rows while OP's production dataset has 12 M rows. The run time of Dave2e's approach increases with the number of alerts.
Also, OP's approach is most demanding concerning allocated memory. It allocates up to 190 MBytes while the data.table
versions only require 1 MByts.
bm
# A tibble: 15 x 15 expression n_trips n_alerts min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time <bch:expr> <dbl> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis> 1 fino 1000 50 2.95s 2.95s 0.339 149.25MB 3.39 1 10 2.95s <data~ <Rpro~ <bch~ 2 dave 1000 50 21.52ms 22.61ms 41.6 779.46KB 5.94 21 3 505.01ms <data~ <Rpro~ <bch~ 3 alexis 1000 50 21.98ms 23.09ms 41.2 1007.84KB 3.93 21 2 509.5ms <data~ <Rpro~ <bch~ 4 uwe1 1000 50 14.47ms 15.45ms 62.3 919.12KB 1.95 32 1 513.94ms <data~ <Rpro~ <bch~ 5 uwe2 1000 50 14.14ms 14.73ms 64.2 633.2KB 1.95 33 1 513.91ms <data~ <Rpro~ <bch~ 6 fino 1000 200 3.09s 3.09s 0.323 155.12MB 4.53 1 14 3.09s <data~ <Rpro~ <bch~ 7 dave 1000 200 53.4ms 62.11ms 11.0 1.7MB 5.49 8 4 728.02ms <data~ <Rpro~ <bch~ 8 alexis 1000 200 22.42ms 23.34ms 41.5 1.03MB 3.77 22 2 530.25ms <data~ <Rpro~ <bch~ 9 uwe1 1000 200 14.72ms 15.76ms 62.2 953.07KB 0 32 0 514.46ms <data~ <Rpro~ <bch~ 10 uwe2 1000 200 14.49ms 15.17ms 63.6 695.63KB 1.99 32 1 503.4ms <data~ <Rpro~ <bch~ 11 fino 1000 1000 3.6s 3.6s 0.278 187.32MB 3.61 1 13 3.6s <data~ <Rpro~ <bch~ 12 dave 1000 1000 242.46ms 243.7ms 4.07 6.5MB 6.78 3 5 737.06ms <data~ <Rpro~ <bch~ 13 alexis 1000 1000 24.32ms 25.78ms 37.6 1.21MB 3.95 19 2 505.84ms <data~ <Rpro~ <bch~ 14 uwe1 1000 1000 16.04ms 16.8ms 56.8 1.05MB 1.96 29 1 510.23ms <data~ <Rpro~ <bch~ 15 uwe2 1000 1000 15.69ms 16.41ms 54.8 938.74KB 3.92 28 2 510.63ms <data~ <Rpro~ <bch~
So, Fino's approach is omitted from the second benchmark run with larger problem sizes of 10k and 100k rows for consEx
and 2k and 20k rows for alertsEx
, resp.
As Dave2e's approach is about 100 times slower for 20k rows than the fastest approach it is omitted from the third run. This will simulate OP's production dataset of 12M rows consEx
and 2M rows of alertsEx
:
print(bm, n = Inf)
# A tibble: 27 x 15 expression n_trips n_alerts min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time <bch:expr> <dbl> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis> 1 alexis 1.00e5 20000 747.87ms 747.87ms 1.34 38.18MB 6.69 1 5 747.87ms <data~ <Rpro~ <bch~ 2 uwe1 1.00e5 20000 124.52ms 138.89ms 7.32 37.88MB 3.66 4 2 546.17ms <data~ <Rpro~ <bch~ 3 uwe2 1.00e5 20000 72.29ms 73.76ms 11.3 16.63MB 1.88 6 1 533.26ms <data~ <Rpro~ <bch~ 4 alexis 1.00e6 20000 7.51s 7.51s 0.133 335.44MB 3.33 1 25 7.51s <data~ <Rpro~ <bch~ 5 uwe1 1.00e6 20000 995.31ms 995.31ms 1.00 346.94MB 2.01 1 2 995.31ms <data~ <Rpro~ <bch~ 6 uwe2 1.00e6 20000 714.19ms 714.19ms 1.40 89.27MB 1.40 1 1 714.19ms <data~ <Rpro~ <bch~ 7 alexis 1.00e7 20000 1.67m 1.67m 0.00999 3.21GB 0.340 1 34 1.67m <data~ <Rpro~ <bch~ 8 uwe1 1.00e7 20000 3.44m 3.44m 0.00485 3.41GB 0.00969 1 2 3.44m <data~ <Rpro~ <bch~ 9 uwe2 1.00e7 20000 42.51s 42.51s 0.0235 810.3MB 0 1 0 42.51s <data~ <Rpro~ <bch~ 10 alexis 1.00e5 200000 15.38s 15.38s 0.0650 73.22MB 0.0650 1 1 15.38s <data~ <Rpro~ <bch~ 11 uwe1 1.00e5 200000 1.34s 1.34s 0.747 63.81MB 0 1 0 1.34s <data~ <Rpro~ <bch~ 12 uwe2 1.00e5 200000 1.47s 1.47s 0.681 58.98MB 0 1 0 1.47s <data~ <Rpro~ <bch~ 13 alexis 1.00e6 200000 2.91m 2.91m 0.00573 375.93MB 0.0115 1 2 2.91m <data~ <Rpro~ <bch~ 14 uwe1 1.00e6 200000 9.72s 9.72s 0.103 371.69MB 0 1 0 9.72s <data~ <Rpro~ <bch~ 15 uwe2 1.00e6 200000 888.67ms 888.67ms 1.13 161.82MB 0 1 0 888.67ms <data~ <Rpro~ <bch~ 16 alexis 1.00e7 200000 6.29m 6.29m 0.00265 3.15GB 0.0928 1 35 6.29m <data~ <Rpro~ <bch~ 17 uwe1 1.00e7 200000 2.45m 2.45m 0.00681 3.43GB 0.0136 1 2 2.45m <data~ <Rpro~ <bch~ 18 uwe2 1.00e7 200000 12.48m 12.48m 0.00134 887.99MB 0.00134 1 1 12.48m <data~ <Rpro~ <bch~ 19 alexis 1.00e5 2000000 3.04s 3.04s 0.329 207MB 0 1 0 3.04s <data~ <Rpro~ <bch~ 20 uwe1 1.00e5 2000000 2.96s 2.96s 0.338 196.42MB 0 1 0 2.96s <data~ <Rpro~ <bch~ 21 uwe2 1.00e5 2000000 2.81s 2.81s 0.355 187.79MB 0 1 0 2.81s <data~ <Rpro~ <bch~ 22 alexis 1.00e6 2000000 6.96m 6.96m 0.00239 726.14MB 0.00479 1 2 6.96m <data~ <Rpro~ <bch~ 23 uwe1 1.00e6 2000000 2.01m 2.01m 0.00827 631.1MB 0 1 0 2.01m <data~ <Rpro~ <bch~ 24 uwe2 1.00e6 2000000 30.54s 30.54s 0.0327 584.81MB 0 1 0 30.54s <data~ <Rpro~ <bch~ 25 alexis 1.00e7 2000000 31.54m 31.54m 0.000528 3.66GB 0.0127 1 24 31.54m <data~ <Rpro~ <bch~ 26 uwe1 1.00e7 2000000 8.72m 8.72m 0.00191 3.67GB 0 1 0 8.72m <data~ <Rpro~ <bch~ 27 uwe2 1.00e7 2000000 12.35m 12.35m 0.00135 1.58GB 0 1 0 12.35m <data~ <Rpro~ <bch~
Caveat: Please, note that run times for most of the parameter variations are only confirmed by one measurement. In addition, garbage collection (gc) may have an influence due to the limited amount of RAM on my computer (8 GBytes). So, your mileage may vary.
The timings are not consistent. In general, uwe2 (in-place update) is substantially faster than uwe1 (new result set) which confirms jangorecki's assumption. alexis is almost always the slowest. (Please note the logarithmic time scale in the chart or compare the timimgs in the table.)
However, some parameter variations show deviations from above pattern. For 10M trips uwe1 is faster than uwe2 in two cases and alexis is also catching up. These effects might be due to garbage collection.
Memory allocation is much less for the in-place update approach (even if this does not always pay out in terms of speed).
Again, interpret the results with care. The third benchmark run should be repeated on a larger machine and with more patience to allow for repeated measurements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With