First of all, I am new to R (I started yesterday).
I have two groups of points, data
and centers
, the first one of size n
and the second of size K
(for instance, n = 3823
and K = 10
), and for each i
in the first set, I need to find j
in the second with the minimum distance.
My idea is simple: for each i
, let dist[j]
be the distance between i
and j
, I only need to use which.min(dist)
to find what I am looking for.
Each point is an array of 64
doubles, so
> dim(data)
[1] 3823 64
> dim(centers)
[1] 10 64
I have tried with
for (i in 1:n) {
for (j in 1:K) {
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
S[i] <- which.min(d)
}
which is extremely slow (with n = 200
, it takes more than 40s!!). The fastest solution that I wrote is
distance <- function(point, group) {
return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}
for (i in 1:n) {
d <- distance(data[i,], centers)
which.min(d)
}
Even if it does a lot of computation that I don't use (because dist(m)
computes the distance between all rows of m
), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance
code is very ugly. I tried to replace it with
distance <- function(point, group) {
return (dist(rbind(point,group))[1:nrow(group)])
}
but this seems to be twice slower. I also tried to use dist
for each pair, but it is also slower.
I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?
ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans
do all computation in less than one second.
Distance between two points is the length of the line segment that connects the two points in a plane. The formula to find the distance between the two points is usually given by d=√((x2 – x1)² + (y2 – y1)²). This formula is used to find the distance between any two points on a coordinate plane or x-y plane.
Formula to calculate this distance is : Euclidean distance = √Σ(xi-yi)^2 where, x and y are the input values. The distance between 2 arrays can also be calculated in R, the array function takes a vector and array dimension as inputs.
The distance from (x0, y0) to this line is measured along a vertical line segment of length |y0 − (− c/b)| = |by0 + c|/|b| in accordance with the formula. Similarly, for vertical lines (b = 0) the distance between the same point and the line is |ax0 + c|/|a|, as measured along a horizontal line segment.
For example, we are given two vectors, vect1 as (1, 4, 3, 5) and vect2 as (2, 3, 2, 4). Their Euclidean distance is given by, √(1 – 2)2 + (4 – 3)2 + (3 – 2)2 + (5 – 4)2 which is equal to 2.
My solution:
# data is a matrix where each row is a point
# point is a vector of values
euc.dist <- function(data, point) {
apply(data, 1, function (row) sqrt(sum((point - row) ^ 2)))
}
You can try it, like:
x <- matrix(rnorm(25), ncol=5)
euc.dist(x, x[1,])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With