I need to generate a dataframe with minimum euclidean distance between each row of a dataframe and all other rows of another dataframe.Both my dataframes are large (approx 40,000 rows).This is what I could work out till now. <pre class="prettyprint"><code>x<-matrix(c(3,6,3,4,8),nrow=5,ncol=7,byrow = TRUE) y<-matrix(c(1,4,4,1,9),nrow=5,ncol=7,byrow = TRUE) sed.dist<-numeric(5) for (i in 1:(length(sed.dist))) { sed.dist[i]<-(sqrt(sum((y[i,1:7] - x[i,1:7])^2))) } </code></pre> But this only works when i=j.What I essentially need is to find the minimum euclidean distance by looping over every row one by one ( y[1,1:7],then y[2,1:7] and so on till i= 5 ) of the "y" dataframe with all the rows of the "x"dataframe(x[i,1:7]).Each time it does this,I need it to find the minimum euclidean distance for each computation of row i of the y dataframe and all the rows of the x dataframe and store it in another dataframe.

Try this: <pre class="prettyprint"><code>apply(y,1,function(y) min(apply(x,1,function(x,y)dist(rbind(x,y)),y))) # [1] 5.196152 5.385165 4.898979 4.898979 5.385165 </code></pre> Working from the inside out, we bind a row of x to a row of y and calcualte the distance between them usin the <code>dist(...)</code> function (written in C). We do this for a given row of y with each row of x in turn, using the inner <code>apply(...)</code>, and then find the minimum of the result. Then we do this for each row of y in the outer call to <code>apply(...)</code>.

calculating the euclidean dist between each row of a dataframe with all other rows in another dataframe

Tags:

loops

for-loop

r

euclidean-distance

I need to generate a dataframe with minimum euclidean distance between each row of a dataframe and all other rows of another dataframe.Both my dataframes are large (approx 40,000 rows).This is what I could work out till now.

x<-matrix(c(3,6,3,4,8),nrow=5,ncol=7,byrow = TRUE)     
y<-matrix(c(1,4,4,1,9),nrow=5,ncol=7,byrow = TRUE)


sed.dist<-numeric(5)
for (i in 1:(length(sed.dist))) {
sed.dist[i]<-(sqrt(sum((y[i,1:7] - x[i,1:7])^2)))
}

But this only works when i=j.What I essentially need is to find the minimum euclidean distance by looping over every row one by one ( y[1,1:7],then y[2,1:7] and so on till i= 5 ) of the "y" dataframe with all the rows of the "x"dataframe(x[i,1:7]).Each time it does this,I need it to find the minimum euclidean distance for each computation of row i of the y dataframe and all the rows of the x dataframe and store it in another dataframe.

933

asked Mar 06 '14 17:03

user14845

2 Answers

Try this:

apply(y,1,function(y) min(apply(x,1,function(x,y)dist(rbind(x,y)),y)))
# [1] 5.196152 5.385165 4.898979 4.898979 5.385165

Working from the inside out, we bind a row of x to a row of y and calcualte the distance between them usin the dist(...) function (written in C). We do this for a given row of y with each row of x in turn, using the inner apply(...), and then find the minimum of the result. Then we do this for each row of y in the outer call to apply(...).

190

answered Sep 20 '22 02:09

jlhoward

Expanding on my comment on the question, a pretty fast approach would be the following, although with 40,000 rows you'll have to wait a bit, I guess:

unlist(lapply(seq_len(nrow(y)), function(i) min(sqrt(colSums((y[i, ] - t(x))^2)))))
#[1] 5.196152 5.385165 4.898979 4.898979 5.385165

And a comparing benchmarking:

x = matrix(runif(1e2*5), 1e2)
y = matrix(runif(1e2*5), 1e2)
library(microbenchmark)
alex = function() unlist(lapply(seq_len(nrow(y)), 
                           function(i) min(sqrt(colSums((y[i, ] - t(x))^2)))))
jlhoward = function() apply(y,1,function(y)
                                  min(apply(x,1,function(x,y)dist(rbind(x,y)),y)))
all.equal(alex(), jlhoward())
#[1] TRUE
microbenchmark(alex(), jlhoward(), times = 20)
#Unit: milliseconds
#       expr        min         lq     median         uq        max neval
#     alex()   3.369188   3.479011   3.600354   4.513114   4.789592    20
# jlhoward() 422.198621 431.565643 436.561057 442.643181 602.929742    20

answered Sep 24 '22 02:09

alexis_laz

Related questions
                            
                                data.table bug, causing a segfault in R
                            
                                Axis labels for each bar and each group in bar charts with dodged groups
                            
                                Converting a column of type 'list' to multiple columns in a data frame
                            
                                Geographical borders incomplete using geom_polygon for plotting map - ggplot2
                            
                                Efficiently computing a linear combination of data.table columns
                            
                                Create a data frame using text input in Shiny
                            
                                I am trying to Make RDotNet work with C#, and I am encountering problems
                            
                                How to select rows in a data.frame without NA values [closed]
                            
                                Use reactivePoll to accumulate data for output
                            
                                ggplot2: barplot with colors as a function of y-axis value
                            
                                Scraping data from tables on multiple web pages in R (football players)
                            
                                How to perform clustering without removing rows where NA is present in R
                            
                                How to configure R-3.0.1 with --enable-R-shlib [duplicate]
                            
                                fread segfault with 30GB space separated file with some rows starting with space
                            
                                Parallel execution of train in caret fails with function not found
                            
                                When and why is crossover beneficial in differential evolution?
                            
                                Create ggmap with points, faceted, and each facet zoomed appropriately?
                            
                                Why does rbindlist not respect column names?
                            
                                Simultaneous variable assignment and printing
                            
                                ggplot2, geom_boxplot with custom quantiles and outliers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With