Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I weight variables with gower distance in r

I am new to R and am working on a data set including nominal, ordinal and metric data. Therefore I am using the gower distance. In the next step I use this distance with hclust(x, method="complete") to create clusters based on this distance.

Now I want to know how I can put different weights on variables in the gower distance. The documentation says:

daisy(x, metric = c("euclidean", "manhattan", "gower"), stand = FALSE, type = list(), weights = rep.int(1, p))

So there is a way, but I am unsure about the syntax (weights = ...). The documentation of weights and rep.int, did not help. I also didn't find any other helpful explanation.

I would be very glad, if some one can help out.

like image 232
user3231946 Avatar asked Jan 24 '14 14:01

user3231946


1 Answers

Not sure if this is what you are getting at, but...

Let's say you have 5 variables, e.g. 5 columns in your data frame or matrix. Then weights would be a vector of length=5 containing the weights for the corresponding columns.

The notation weights=rep.int(1,p) in the documentation simply means that the default value of weights is a vector of length p that has all 1's, eg. the weights are all equal to 1. Elsewhere in the documentation it explains that p is the number of columns.

Also, note that daisy(...) produces a dissimilarity matrix. This is what you use in hclust(...). So if x is a data frame or matrix with five columns for your variables, then:

d  <- daisy(x, metric="gower", weights=c(1,2,3,4,5))
hc <- hclust(d, method="complete")

EDIT (Response to OP's comments)

The code below shows how the clustering depends on the weights.

clust.anal <- function(df,w,h) {
  require(cluster)
  d  <- daisy(df, metric="gower", weights=w)
  hc <- hclust(d, method="complete")
  clust <- cutree(hc,h=h)
  plot(hc, sub=paste("weights=",paste(wts,collapse=",")))
  rect.hclust(hc,h=0.8,border="red")

}

df <- read.table("ExampleClusterData.csv", sep=";",header=T)
df[1] <- factor(df[[1]])
df[2] <- factor(df[[2]])
# weights increase with col number...
wts=c(1,2,3,4,5,6,7)
clust.anal(df,wts,h=0.8)

# weights decrease with col number...
wts=c(7,6,5,4,3,2,1)
clust.anal(df,wts,h=0.8)

like image 136
jlhoward Avatar answered Sep 30 '22 18:09

jlhoward