Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minimum id with non-repetitive elements

Tags:

r

I am stuck in a difficult problem in R and am not able to resolve it. The problem goes like this.

x and y are two vectors, as given below:

x<- c(1,2,3,4,5)
y<- c(12,4,2,5,7,18,9,10)

I want to create a new vector p, where length(p) = length(x), in the following manner:

  • For each id in x, find the id in y which has minimum absolute distance in terms of values. For instance, for id=1 in x, value_x(id=1)=1, min_value_y =2, and id_y(value==2) = 3. Thus, the answer to id 1 in x is 3. Thus, we create a new vector p, which will have following values: p = (3,3,3,2,4);

Now we have to update p, in the following manner:

  • As 3 has been the id corresponding to id_x=1, it can't be the id for id_x=2. Hence, we have to discard id_y =3 with value 2, to calculate the next minimum distance for id_x=2. Next best minimum distance for id_x=2 is id_y=2 with value 4. Hence, updated p is (3,2,3,2,4).
  • As 3 has been the id corresponding to id_x=1, it can't be the id for id_x=3. Hence, we have to discard id_y =3 with value 2, to calculate the next minimum distance for id_x=3. Next best minimum distance for id_x=3 is 2. Hence, updated p is (3,2,4,2,4).

As next values in p is 2, and 4 we have to repeat what we did in the last two steps. In summary, while calculating the minimum distance between x and y, for each id of x we have to get that id of y which hasn't been previously appeared. Thus all the elements of p has to be unique.

Any answers would be appreciated.

I tried something like this, though not a complete solution:

minID <- function(x,y) {return(which(abs(x-y)==min(abs(x-y))))};
p1 <- sapply(x,minID,y=y);
#Calculates the list of all minimum elements -no where close to actual solution :(

I have a x and y over 1 million, hence for loop would be extremely slow. I am looking for a faster solution.

like image 668
Sumit Avatar asked Dec 15 '22 00:12

Sumit


2 Answers

This can be implemented efficiently with a binary search tree on the elements of y, deleting elements as they're matched and added to p. I've implemented this using set from the stl in C++, using Rcpp to get the code into R:

library(Rcpp)
getVals = cppFunction(
'NumericVector getVals(NumericVector x, NumericVector y) {
    NumericVector p(x.size());
    std::vector<std::pair<double, int> > init;
    for (int j=0; j < y.size(); ++j) {
        init.push_back(std::pair<double, int>(y[j], j));
    }
    std::set<std::pair<double, int> > s(init.begin(), init.end());
    for (int i=0; i < x.size(); ++i) {
        std::set<std::pair<double, int> >::iterator p1, p2, selected;
        p1 = s.lower_bound(std::pair<double, int>(x[i], 0));
        p2 = p1;
        --p2;
        if (p1 == s.end()) {
            selected = p2;
        } else if (p2 == s.begin()) {
            selected = p1;
        } else if (fabs(x[i] - p1->first) < fabs(x[i] - p2->first)) {
            selected = p1;
        } else {
            selected = p2;
        }
        p[i] = selected->second+1;  // 1-indexed
        s.erase(selected);
    }
    return p;
}')

Here's a runtime comparison against the pure-R solution that was posted -- the binary search tree solution is much faster and enables solutions with vectors of length 1 million in just a few seconds:

# Pure-R posted solution
getVals2 = function(x, y) {
  n <- length(x)
  p <- rep(NA, n)
  for(i in 1:n) {
    id <- which.min(abs(y - x[i]))
    y[id] <- Inf
    p[i] <- id
  }
  return(p)
}

# Test with medium-sized vectors
set.seed(144)
x = rnorm(10000)
y = rnorm(20000)
system.time(res1 <- getVals(x, y))
#    user  system elapsed 
#   0.008   0.000   0.008 
system.time(res2 <- getVals2(x, y))
#    user  system elapsed 
#   1.284   2.919   4.211 
all.equal(res1, res2)
# [1] TRUE

# Test with large vectors
set.seed(144)
x = rnorm(1000000)
y = rnorm(2000000)
system.time(res3 <- getVals(x, y))
#    user  system elapsed 
#   4.402   0.097   4.467 

The reason for the speedup is because this approach is asymptotically faster -- if x is of size n and y is of size m, then the binary search tree approach runs in O((n+m)log(m)) time -- O(m log(m)) to construct the BST and O(n log(m)) to compute p -- while the which.min approach runs in O(nm) time.

like image 66
josliber Avatar answered Dec 29 '22 00:12

josliber


n <- length(x)
p <- rep(NA, n)
for(i in 1:n) {
  id <- which.min(abs(y - x[i]))
  y[id] <- Inf
  p[i] <- id
}
like image 23
danas.zuokas Avatar answered Dec 29 '22 00:12

danas.zuokas