Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NA in clustering functions (kmeans, pam, clara). How to associate clusters to original data?

I need to cluster some data and I tried kmeans, pam, and clara with R.

The problem is that my data are in a column of a data frame, and contains NAs.

I used na.omit() to get my clusters. But then how can I associate them with the original data? The functions return a vector of integers without the NAs and they don't retain any information about the original position.

Is there a clever way to associate the clusters to the original observations in the data frame? (or a way to intelligently perform clustering when NAs are present?)

Thanks

like image 259
Bakaburg Avatar asked Dec 18 '14 11:12

Bakaburg


People also ask

How do the PAM clustering results compare to the K-Means results?

2 PAM. The main difference between K-means and PAM method is that K-means uses centroids (usually artifficial points), while PAM uses medodoids, which are always the actual points in the dataset.

How are no of clusters determined in K-means?

The optimal number of clusters can be defined as follow: Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters. For each k, calculate the total within-cluster sum of square (wss). Plot the curve of wss according to the number of clusters k.

How does Kmeans assign each new data point?

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

What does the PAM function do in R?

The R function pam() [cluster package] can be used to compute PAM algorithm. The simplified format is pam(x, k), where “x” is the data and k is the number of clusters to be generated. After, performing PAM clustering, the R function fviz_cluster() [factoextra package] can be used to visualize the results.


2 Answers

The output of kmeans corresponds to the elements of the object passed as argument x. In your case, you omit the NA elements, and so $cluster indicates the cluster that each element of na.omit(x) belongs to.

Here's a simple example:

d <- data.frame(x=runif(100), cluster=NA)
d$x[sample(100, 10)] <- NA
clus <- kmeans(na.omit(d$x), 5)

d$cluster[which(!is.na(d$x))] <- clus$cluster

And in the plot below, colour indicates the cluster that each point belongs to.

plot(d$x, bg=d$cluster, pch=21)

enter image description here

like image 138
jbaums Avatar answered Sep 27 '22 22:09

jbaums


This code works for me, starting with a matrix containing a whole row of NAs:

DF=matrix(rnorm(100), ncol=10)
row.names(DF) <- paste("r", 1:10, sep="")
DF[3,]<-NA
res <- kmeans(na.omit(DF), 3)$cluster
res
DF=cbind(DF, 'clus'=NA)
DF[names(res),][,11] <- res
print(DF[,11])
like image 21
agenis Avatar answered Sep 27 '22 23:09

agenis