I am stuck in a difficult problem in R and am not able to resolve it. The problem goes like this.
x and y are two vectors, as given below:
x<- c(1,2,3,4,5)
y<- c(12,4,2,5,7,18,9,10)
I want to create a new vector p, where length(p) = length(x), in the following manner:
Now we have to update p, in the following manner:
As next values in p is 2, and 4 we have to repeat what we did in the last two steps. In summary, while calculating the minimum distance between x and y, for each id of x we have to get that id of y which hasn't been previously appeared. Thus all the elements of p has to be unique.
Any answers would be appreciated.
I tried something like this, though not a complete solution:
minID <- function(x,y) {return(which(abs(x-y)==min(abs(x-y))))};
p1 <- sapply(x,minID,y=y);
#Calculates the list of all minimum elements -no where close to actual solution :(
I have a x and y over 1 million, hence for loop would be extremely slow. I am looking for a faster solution.
This can be implemented efficiently with a binary search tree on the elements of y
, deleting elements as they're matched and added to p
. I've implemented this using set
from the stl in C++, using Rcpp to get the code into R:
library(Rcpp)
getVals = cppFunction(
'NumericVector getVals(NumericVector x, NumericVector y) {
NumericVector p(x.size());
std::vector<std::pair<double, int> > init;
for (int j=0; j < y.size(); ++j) {
init.push_back(std::pair<double, int>(y[j], j));
}
std::set<std::pair<double, int> > s(init.begin(), init.end());
for (int i=0; i < x.size(); ++i) {
std::set<std::pair<double, int> >::iterator p1, p2, selected;
p1 = s.lower_bound(std::pair<double, int>(x[i], 0));
p2 = p1;
--p2;
if (p1 == s.end()) {
selected = p2;
} else if (p2 == s.begin()) {
selected = p1;
} else if (fabs(x[i] - p1->first) < fabs(x[i] - p2->first)) {
selected = p1;
} else {
selected = p2;
}
p[i] = selected->second+1; // 1-indexed
s.erase(selected);
}
return p;
}')
Here's a runtime comparison against the pure-R solution that was posted -- the binary search tree solution is much faster and enables solutions with vectors of length 1 million in just a few seconds:
# Pure-R posted solution
getVals2 = function(x, y) {
n <- length(x)
p <- rep(NA, n)
for(i in 1:n) {
id <- which.min(abs(y - x[i]))
y[id] <- Inf
p[i] <- id
}
return(p)
}
# Test with medium-sized vectors
set.seed(144)
x = rnorm(10000)
y = rnorm(20000)
system.time(res1 <- getVals(x, y))
# user system elapsed
# 0.008 0.000 0.008
system.time(res2 <- getVals2(x, y))
# user system elapsed
# 1.284 2.919 4.211
all.equal(res1, res2)
# [1] TRUE
# Test with large vectors
set.seed(144)
x = rnorm(1000000)
y = rnorm(2000000)
system.time(res3 <- getVals(x, y))
# user system elapsed
# 4.402 0.097 4.467
The reason for the speedup is because this approach is asymptotically faster -- if x
is of size n
and y
is of size m
, then the binary search tree approach runs in O((n+m)log(m))
time -- O(m log(m))
to construct the BST and O(n log(m))
to compute p
-- while the which.min
approach runs in O(nm)
time.
n <- length(x)
p <- rep(NA, n)
for(i in 1:n) {
id <- which.min(abs(y - x[i]))
y[id] <- Inf
p[i] <- id
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With