Why does my Rcpp
function inside a data.table
join produce a different (& incorrect) result compared to when used outside of a join?
I have two data.table
s, and I want to find the Euclidean distance between each pair of coordinates across both tables.
To do the distance calculation I've defined two functions, one in base R, and the other using Rcpp
.
library(Rcpp)
library(data.table)
rEucDist <- function(x1, y1, x2, y2) return(sqrt((x2 - x1)^2 + (y2 - y1)^2))
cppFunction('NumericVector cppEucDistance(NumericVector x1, NumericVector y1,
NumericVector x2, NumericVector y2){
int n = x1.size();
NumericVector distance(n);
for(int i = 0; i < n; i++){
distance[i] = sqrt(pow((x2[i] - x1[i]), 2) + pow((y2[i] - y1[i]), 2));
}
return distance;
}')
dt1 <- data.table(id = rep(1, 6),
seq1 = 1:6,
x = c(1:6),
y = c(1:6))
dt2 <- data.table(id = rep(1, 6),
seq2 = 7:12,
x = c(6:1),
y = c(6:1))
When doing a join first, then calculating the distance, both functions produce the same result
dt_cpp <- dt1[ dt2, on = "id", allow.cartesian = T]
dt_cpp[, dist := cppEucDistance(x, y, i.x, i.y)]
dt_r <- dt1[ dt2, on = "id", allow.cartesian = T]
dt_r[, dist := rEucDist(x, y, i.x, i.y)]
all.equal(dt_cpp$dist, dt_r$dist)
# [1] TRUE
However, if I do the calculation within a join the results differ; the cpp version is incorrect.
dt_cppJoin <- dt1[
dt2,
{ (cppEucDistance(x, y, i.x, i.y)) },
on = "id",
by = .EACHI
]
dt_rJoin <- dt1[
dt2,
{ (rEucDist(x, y, i.x, i.y)) },
on = "id",
by = .EACHI
]
all.equal(dt_cppJoin$V1, dt_rJoin$V1)
# "Mean relative difference: 0.6173913"
## note that the R version of the join is correct
all.equal(dt_r$dist, dt_rJoin$V1)
# [1] TRUE
What is it about the Rcpp
implementation that causes the join version to give a different result?
Ultimately, this comes down to my lack of understanding of how EACHI
works.
My C++ loop as it stands is relying on x1
and x2
being the same length. This is not the case when EACHI
is used, so my vector-subsetting in the C++ function won't be correct. Therefore I'll need a more sophisticated C++ function.
I believe the 'issue' I'm seeing comes down to what EACHI
is doing. After re-reading Arun's answer again and again I think I understand it.
In particular, EACHI
"evaluates j-expression on the matching rows for each row in Y". So it's taking the Y
table (dt2
in my case) and evaluating it each row at a time. This can be seen if you use a few prints
inside the Rcpp function
## I'm reducing the data sets so the 'prints' are readible
dt1 <- data.table(id = rep(1, 2),
seq1 = 1:2,
x = c(1:2),
y = c(1:2))
dt2 <- data.table(id = rep(1, 2),
seq2 = 7:8,
x = c(6:5),
y = c(6:5))
cppFunction('NumericVector cppEucDistance(NumericVector x1, NumericVector y1,
NumericVector x2, NumericVector y2){
int n = x1.size();
NumericVector distance(n);
Rcpp::Rcout << "x1 size: " << x1.size() << std::endl;
Rcpp::Rcout << "x2 size: " << x2.size() << std::endl;
Rcpp::Rcout << "n size: " << n << std::endl;
for(int i = 0; i < n; i++){
distance[i] = sqrt(pow((x2[i] - x1[i]), 2) + pow((y2[i] - y1[i]), 2));
}
return distance;
}')
dt_cppJoin <- dt1[
dt2,
{print(i.x); (cppEucDistance(x, y, i.x, i.y)) },
on = "id",
by = .EACHI
]
# [1] 6
# x1 size: 2
# x2 size: 1
# n size: 2
# [1] 5
# x1 size: 2
# x2 size: 1
# n size: 2
Notice in the output of the print statement that x2
is only length 1 each iteration. Whereas n
is 2. So my x2[i]
will obviously 'fail' when i
gets to the second element (remembering that C++ arrays are 0-indexed)
Whereas taking out the EACHI
has the same effect as the full-join approach
dt_cppJoin <- dt1[
dt2,
{print(i.x); (cppEucDistance(x, y, i.x, i.y)) },
on = "id"
]
# [1] 6 6 5 5
# x1 size: 4
# x2 size: 4
# n size: 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With