Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rcpp function inside data.table join

Tags:

r

data.table

rcpp

Question

Why does my Rcpp function inside a data.table join produce a different (& incorrect) result compared to when used outside of a join?

Example

I have two data.tables, and I want to find the Euclidean distance between each pair of coordinates across both tables.

To do the distance calculation I've defined two functions, one in base R, and the other using Rcpp.

library(Rcpp)
library(data.table)

rEucDist <- function(x1, y1, x2, y2) return(sqrt((x2 - x1)^2 + (y2 - y1)^2))

cppFunction('NumericVector cppEucDistance(NumericVector x1, NumericVector y1,
                                          NumericVector x2, NumericVector y2){

            int n = x1.size();
            NumericVector distance(n);

            for(int i = 0; i < n; i++){
              distance[i] = sqrt(pow((x2[i] - x1[i]), 2) + pow((y2[i] - y1[i]), 2));
            }
            return distance;
}')


dt1 <- data.table(id = rep(1, 6),
                  seq1 = 1:6,
                  x = c(1:6),
                  y = c(1:6))

dt2 <- data.table(id = rep(1, 6),
                  seq2 = 7:12,
                  x = c(6:1),
                  y = c(6:1))

When doing a join first, then calculating the distance, both functions produce the same result

dt_cpp <- dt1[ dt2, on = "id", allow.cartesian = T]
dt_cpp[, dist := cppEucDistance(x, y, i.x, i.y)]

dt_r <- dt1[ dt2, on = "id", allow.cartesian = T]
dt_r[, dist := rEucDist(x, y, i.x, i.y)]

all.equal(dt_cpp$dist, dt_r$dist)
# [1] TRUE

However, if I do the calculation within a join the results differ; the cpp version is incorrect.

dt_cppJoin <- dt1[
    dt2,
    { (cppEucDistance(x, y, i.x, i.y)) },
    on = "id",
    by = .EACHI
    ]

dt_rJoin <- dt1[
    dt2,
    { (rEucDist(x, y, i.x, i.y)) },
    on = "id",
    by = .EACHI
    ]

all.equal(dt_cppJoin$V1, dt_rJoin$V1)
# "Mean relative difference: 0.6173913"

## note that the R version of the join is correct
all.equal(dt_r$dist, dt_rJoin$V1)
# [1] TRUE

What is it about the Rcpp implementation that causes the join version to give a different result?

like image 933
SymbolixAU Avatar asked Jun 08 '17 23:06

SymbolixAU


1 Answers

tl;dr

  1. Ultimately, this comes down to my lack of understanding of how EACHI works.

  2. My C++ loop as it stands is relying on x1 and x2 being the same length. This is not the case when EACHI is used, so my vector-subsetting in the C++ function won't be correct. Therefore I'll need a more sophisticated C++ function.


I believe the 'issue' I'm seeing comes down to what EACHI is doing. After re-reading Arun's answer again and again I think I understand it.

In particular, EACHI "evaluates j-expression on the matching rows for each row in Y". So it's taking the Y table (dt2 in my case) and evaluating it each row at a time. This can be seen if you use a few prints inside the Rcpp function

## I'm reducing the data sets so the 'prints' are readible
dt1 <- data.table(id = rep(1, 2),
                  seq1 = 1:2,
                  x = c(1:2),
                  y = c(1:2))

dt2 <- data.table(id = rep(1, 2),
                  seq2 = 7:8,
                  x = c(6:5),
                  y = c(6:5))

cppFunction('NumericVector cppEucDistance(NumericVector x1, NumericVector y1,
                                          NumericVector x2, NumericVector y2){

  int n = x1.size();
  NumericVector distance(n);

  Rcpp::Rcout << "x1 size: " << x1.size() << std::endl;
  Rcpp::Rcout << "x2 size: " << x2.size() << std::endl;
  Rcpp::Rcout << "n size: " << n << std::endl;

  for(int i = 0; i < n; i++){
    distance[i] = sqrt(pow((x2[i] - x1[i]), 2) + pow((y2[i] - y1[i]), 2));
  }
  return distance;
}')

dt_cppJoin <- dt1[
  dt2,
  {print(i.x); (cppEucDistance(x, y, i.x, i.y)) },
  on = "id",
  by = .EACHI
  ]
# [1] 6
# x1 size: 2
# x2 size: 1
# n size: 2
# [1] 5
# x1 size: 2
# x2 size: 1
# n size: 2

Notice in the output of the print statement that x2 is only length 1 each iteration. Whereas n is 2. So my x2[i] will obviously 'fail' when i gets to the second element (remembering that C++ arrays are 0-indexed)

Whereas taking out the EACHI has the same effect as the full-join approach

dt_cppJoin <- dt1[
  dt2,
  {print(i.x); (cppEucDistance(x, y, i.x, i.y)) },
  on = "id"
  ]
# [1] 6 6 5 5
# x1 size: 4
# x2 size: 4
# n size: 4

like image 121
SymbolixAU Avatar answered Oct 22 '22 01:10

SymbolixAU