I need to generate a dataframe with minimum euclidean distance between each row of a dataframe and all other rows of another dataframe.Both my dataframes are large (approx 40,000 rows).This is what I could work out till now.
x<-matrix(c(3,6,3,4,8),nrow=5,ncol=7,byrow = TRUE)
y<-matrix(c(1,4,4,1,9),nrow=5,ncol=7,byrow = TRUE)
sed.dist<-numeric(5)
for (i in 1:(length(sed.dist))) {
sed.dist[i]<-(sqrt(sum((y[i,1:7] - x[i,1:7])^2)))
}
But this only works when i=j.What I essentially need is to find the minimum euclidean distance by looping over every row one by one ( y[1,1:7],then y[2,1:7] and so on till i= 5 ) of the "y" dataframe with all the rows of the "x"dataframe(x[i,1:7]).Each time it does this,I need it to find the minimum euclidean distance for each computation of row i of the y dataframe and all the rows of the x dataframe and store it in another dataframe.
The Euclidean distance is simply the square root of the squared differences between corresponding elements of the rows (or columns). This is probably the most commonly used distance metric.
Euclidean Distance Examples Determine the Euclidean distance between two points (a, b) and (-a, -b). d = 2√(a2+b2). Hence, the distance between two points (a, b) and (-a, -b) is 2√(a2+b2).
In this method, we first initialize two numpy arrays. Then, we take the difference of the two arrays, compute the dot product of the result, and transpose of the result. Then we take the square root of the answer. This is another way to implement Euclidean distance.
The formula to calculate the distance between two points (x1 1 , y1 1 ) and (x2 2 , y2 2 ) is d = √[(x2 – x1)2 + (y2 – y1)2]. There are 4 different approaches for finding the Euclidean distance in Python using the NumPy and SciPy libraries.
Try this:
apply(y,1,function(y) min(apply(x,1,function(x,y)dist(rbind(x,y)),y)))
# [1] 5.196152 5.385165 4.898979 4.898979 5.385165
Working from the inside out, we bind a row of x to a row of y and calcualte the distance between them usin the dist(...)
function (written in C). We do this for a given row of y with each row of x in turn, using the inner apply(...)
, and then find the minimum of the result. Then we do this for each row of y in the outer call to apply(...)
.
Expanding on my comment on the question, a pretty fast approach would be the following, although with 40,000 rows you'll have to wait a bit, I guess:
unlist(lapply(seq_len(nrow(y)), function(i) min(sqrt(colSums((y[i, ] - t(x))^2)))))
#[1] 5.196152 5.385165 4.898979 4.898979 5.385165
And a comparing benchmarking:
x = matrix(runif(1e2*5), 1e2)
y = matrix(runif(1e2*5), 1e2)
library(microbenchmark)
alex = function() unlist(lapply(seq_len(nrow(y)),
function(i) min(sqrt(colSums((y[i, ] - t(x))^2)))))
jlhoward = function() apply(y,1,function(y)
min(apply(x,1,function(x,y)dist(rbind(x,y)),y)))
all.equal(alex(), jlhoward())
#[1] TRUE
microbenchmark(alex(), jlhoward(), times = 20)
#Unit: milliseconds
# expr min lq median uq max neval
# alex() 3.369188 3.479011 3.600354 4.513114 4.789592 20
# jlhoward() 422.198621 431.565643 436.561057 442.643181 602.929742 20
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With