Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating all distances between one point and a group of points efficiently in R

First of all, I am new to R (I started yesterday).

I have two groups of points, data and centers, the first one of size n and the second of size K (for instance, n = 3823 and K = 10), and for each i in the first set, I need to find j in the second with the minimum distance.

My idea is simple: for each i, let dist[j] be the distance between i and j, I only need to use which.min(dist) to find what I am looking for.

Each point is an array of 64 doubles, so

> dim(data)
[1] 3823   64
> dim(centers)
[1] 10 64

I have tried with

for (i in 1:n) {
  for (j in 1:K) {
    d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
  }
  S[i] <- which.min(d)
}

which is extremely slow (with n = 200, it takes more than 40s!!). The fastest solution that I wrote is

distance <- function(point, group) {
  return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}

for (i in 1:n) {
  d <- distance(data[i,], centers)
  which.min(d)
}

Even if it does a lot of computation that I don't use (because dist(m) computes the distance between all rows of m), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance code is very ugly. I tried to replace it with

distance <- function(point, group) {
  return (dist(rbind(point,group))[1:nrow(group)])
}

but this seems to be twice slower. I also tried to use dist for each pair, but it is also slower.

I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?

ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans do all computation in less than one second.

like image 524
dbarbosa Avatar asked Jun 12 '10 18:06

dbarbosa


People also ask

How do you find the distance between each pair of points?

Distance between two points is the length of the line segment that connects the two points in a plane. The formula to find the distance between the two points is usually given by d=√((x2 – x1)² + (y2 – y1)²). This formula is used to find the distance between any two points on a coordinate plane or x-y plane.

How do you use the distance formula in R?

Formula to calculate this distance is : Euclidean distance = √Σ(xi-yi)^2 where, x and y are the input values. The distance between 2 arrays can also be calculated in R, the array function takes a vector and array dimension as inputs.

How do you find the distance between a point and a line segment?

The distance from (x0, y0) to this line is measured along a vertical line segment of length |y0 − (− c/b)| = |by0 + c|/|b| in accordance with the formula. Similarly, for vertical lines (b = 0) the distance between the same point and the line is |ax0 + c|/|a|, as measured along a horizontal line segment.

How will be the Euclidean distance calculated between 2 coordinates in R?

For example, we are given two vectors, vect1 as (1, 4, 3, 5) and vect2 as (2, 3, 2, 4). Their Euclidean distance is given by, √(1 – 2)2 + (4 – 3)2 + (3 – 2)2 + (5 – 4)2 which is equal to 2.


1 Answers

My solution:

# data is a matrix where each row is a point
# point is a vector of values
euc.dist <- function(data, point) {
  apply(data, 1, function (row) sqrt(sum((point - row) ^ 2)))
}

You can try it, like:

x <- matrix(rnorm(25), ncol=5)
euc.dist(x, x[1,])
like image 144
Adriano Rivolli Avatar answered Oct 23 '22 22:10

Adriano Rivolli