I'd like to create a distance-matrix with weighted euclidean distances from a data frame. The weights will be defined in a vector. Here's an example:
library("cluster")
a <- c(1,2,3,4,5)
b <- c(5,4,3,2,1)
c <- c(5,4,1,2,3)
df <- data.frame(a,b,c)
weighting <- c(1, 2, 3)
dm <- as.matrix(daisy(df, metric = "euclidean", weights = weighting))
I've searched everywhere and can't find a package or solution to this in R. The 'daisy' function within the 'cluster' package claims to support weighting, but the weights don't seem to be applied and it just spits out regular euclid. distances.
Any ideas Stack Overflow?
Euclidean distance is the shortest possible distance between two points. Formula to calculate this distance is : Euclidean distance = √Σ(xi-yi)^2 where, x and y are the input values. The distance between 2 arrays can also be calculated in R, the array function takes a vector and array dimension as inputs.
The distance-weighted mean is: DWM=w1x1+w2x2+w3x3+w4x4w1+w2+w3+w4≈7.3.
Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.
It is defined as the sum of absolute distance between coordinates in corresponding dimensions. For example, In a 2-dimensional space having two points Point1 (x1,y1) and Point2 (x2,y2), the Manhattan distance is given by |x1 – x2| + |y1 – y2|.
We can use @WalterTross' technique of scaling by multiplying each column by the square root of its respective weight first:
newdf <- sweep(df, 2, weighting, function(x,y) x * sqrt(y))
as.matrix(daisy(newdf, metric="euclidean"))
But just in case you would like to have more control and understanding of what euclidean distance is, we can write a custom function. As a note, I have chosen a different weighting method. :
xpand <- function(d) do.call("expand.grid", rep(list(1:nrow(d)), 2))
euc_norm <- function(x) sqrt(sum(x^2))
euc_dist <- function(mat, weights=1) {
iter <- xpand(mat)
vec <- mapply(function(i,j) euc_norm(weights*(mat[i,] - mat[j,])),
iter[,1], iter[,2])
matrix(vec,nrow(mat), nrow(mat))
}
We can test the result by checking against the daisy
function:
#test1
as.matrix(daisy(df, metric="euclidean"))
# 1 2 3 4 5
# 1 0.000000 1.732051 4.898979 5.196152 6.000000
# 2 1.732051 0.000000 3.316625 3.464102 4.358899
# 3 4.898979 3.316625 0.000000 1.732051 3.464102
# 4 5.196152 3.464102 1.732051 0.000000 1.732051
# 5 6.000000 4.358899 3.464102 1.732051 0.000000
euc_dist(df)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.000000 1.732051 4.898979 5.196152 6.000000
# [2,] 1.732051 0.000000 3.316625 3.464102 4.358899
# [3,] 4.898979 3.316625 0.000000 1.732051 3.464102
# [4,] 5.196152 3.464102 1.732051 0.000000 1.732051
# [5,] 6.000000 4.358899 3.464102 1.732051 0.000000
The reason I doubt Walter's method is because firstly, I've never seen weights applied by their square root, it's usually 1/w
. Secondly, when I apply your weights to my function, I get a different result.
euc_dist(df, weights=weighting)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With