I’ve devised a solution to lookup values from multiple columns of two separate data tables and add a new column based calculations of their values (multiple conditional comparisons). Code below. It involves using a data.table and join while calculating values from both tables, however, the tables aren’t joined on the columns I’m comparing, and therefore I suspect I may not be getting the speed advantages inherent to data.tables that I’ve read so much about and am excited about tapping into. Said another way, I’m joining on a ‘dummy’ column, so I don’t think I’m joining “properly.”
The exercise is, given an X by X grid dtGrid
and a list of X^2 random Events dtEvents
within the grid, to determine how many Events occur within a 1 unit radius of each grid point. The code is below. I picked a grid size of 100 X 100, which takes ~1.5 sec to run the join on my machine. But I can’t go much bigger without introducing an enormous performance hit (200 X 200 takes ~22 sec).
I really like the flexibility of being able to add multiple conditions to my val
statement (e.g., if I wanted to add a bunch of AND and OR combinations I could do that), so I'd like to retain that functionality.
Is there a way to use data.table joins ‘properly’ (or any other data.table solution) to achieve a much speedier / efficient outcome?
Thanks so much!
#Initialization stuff
library(data.table)
set.seed(77L)
#Set grid size constant
#Increasing this number to a value much larger than 100 will result in significantly longer run times
cstGridSize = 100L
#Create Grid
vecXYSquare <- seq(0, cstGridSize, 1)
dtGrid <- data.table(expand.grid(vecXYSquare, vecXYSquare))
setnames(dtGrid, 'Var1', 'x')
setnames(dtGrid, 'Var2', 'y')
dtGrid[, DummyJoin:='A']
setkey(dtGrid, DummyJoin)
#Create Events
xrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
yrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
dtEvents <- data.table(x=xrand, y=yrand)
dtEvents[, DummyJoin:='A']
dtEvents[, Counter:=1L]
setkey(dtEvents, DummyJoin)
#Return # of events within 1 unit radius of each grid point
system.time(
dtEventsWithinRadius <- dtEvents[dtGrid, {
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2]; #basic circle fomula: x^2 + y^2 = radius^2
list(col_i.x=i.x, col_i.y=i.y, EventsWithinRadius=sum(val))
}, by=.EACHI]
)
Very interesting question.. and great use of by = .EACHI
! Here's another approach using the NEW non-equi joins from the current development version, v1.9.7.
Issue: Your use of by=.EACHI
is completely justified because the other alternative is to perform a cross join (each row of dtGrid
joined to all rows of dtEvents
) but that's too exhaustive and is bound to explode very quickly.
However by = .EACHI
is performed along with an equi-join using a dummy column, which results in computing all distances (except that it does one at a time, therefore memory efficient). That is, in your code, for each dtGrid
, all possible distances are still computed with dtEvents
; hence it doesn't scale as well as expected.
Strategy: Then you'd agree that an acceptable improvement is to restrict the number of rows that would result from joining each row of dtGrid
to dtEvents
.
Let (x_i, y_i)
come from dtGrid
and (a_j, b_j)
come from from dtEvents
, say, where 1 <= i <= nrow(dtGrid)
and 1 <= j <= nrow(dtEvents)
. Then, i = 1
implies, all j
that satisfies (x1 - a_j)^2 + (y1 - b_j)^2 < 1
needs to be extracted. That can only happen when:
(x1 - a_j)^2 < 1 AND (y1 - b_j)^2 < 1
This helps reduce the search space drastically because, instead of looking at all rows in dtEvents
for each row in dtGrid
, we just have to extract those rows where,
a_j - 1 <= x1 <= a_j + 1 AND b_j - 1 <= y1 <= b_j + 1
# where '1' is the radius
This constraint can be directly translated to a non-equi join, and combined with by = .EACHI
as before. The only additional step required is to construct the columns a_j-1, a_j+1, b_j-1, b_j+1
as follows:
foo1 <- function(dt1, dt2) {
dt2[, `:=`(xm=x-1, xp=x+1, ym=y-1, yp=y+1)] ## (1)
tmp = dt2[dt1, on=.(xm<=x, xp>=x, ym<=y, yp>=y),
.(sum((i.x-x)^2+(i.y-y)^2<1)), by=.EACHI,
allow=TRUE, nomatch=0L
][, c("xp", "yp") := NULL] ## (2)
tmp[]
}
## (1)
constructs all columns necessary for non-equi joins (since expressions are not allowed in the formula for on=
yet.
## (2)
performs a non-equi join that computes distances and checks for all distances that are < 1
on the restricted set of combinations for each row in dtGrid
-- hence should be much faster.
Benchmarks:
# Here's your code (modified to ensure identical column names etc..):
foo2 <- function(dt1, dt2) {
ans = dt2[dt1,
{
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];
.(xm=i.x, ym=i.y, V1=sum(val))
},
by=.EACHI][, "DummyJoin" := NULL]
ans[]
}
# on grid size of 100:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.166s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 1.626s
# on grid size of 200:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.983s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 31.038s
# on grid size of 300:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 2.847s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 151.32s
identical(ans1[V1 != 0]L, ans2[V1 != 0L]) # TRUE for all of them
The speedups are ~10x, 32x and 53x respectively.
Note that the rows in dtGrid
for which the condition is not satisfied even for a single row in dtEvents
will not be present in the result (due to nomatch=0L
). If you want those rows, you'll have to also add one of the xm/xp/ym/yp
cols.. and check them for NA
(= no matches).
This is the reason we had to remove all 0 counts to get identical = TRUE
.
HTH
PS: See history for another variation where the entire join is materialised and then the distance is computed and counts generated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With