Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

K Means Clustering in R - ignoring row id

Tags:

r

I have data frame as follows:

X1      X2         X3
3   102.20000   26.07667 
4   115.00000   25.12500
5   36.70000    35.05545

Where column X1 denotes unique identifier for a row while X2, X3 are features

I want to perform scaling before performing k means clustering on a data,

 mydata <- scale(mydata)


  X1               X2            X3
-11715.6     -12.2200734    -9.7826627
-11714.6       0.5799266    -10.7343294
-11713.6      -77.7200734   -0.8038748

I don't want column X1 to scale but want it to remain on data frame. Any way to do it?

like image 235
Sarit Adhikari Avatar asked Jul 24 '15 09:07

Sarit Adhikari


People also ask

What can be the problems with K means clustering?

k-means has trouble clustering data where clusters are of varying sizes and density. To cluster such data, you need to generalize k-means as described in the Advantages section. Clustering outliers. Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored.

Can k-means handle missing data?

The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, is common in many applications.

How can we make sure that k-means output is not sensitive to initialization?

There are two ways to avoid the problem of initialization sensitivity: Repeat K means: It basically repeats the algorithm again and again along with initializing the centroids followed by picking up the cluster which results in the small intracluster distance and large intercluster distance.

What is Nstart in k-means in R?

The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate 25 initial configurations. This approach is often recommended.


1 Answers

You can tag the unique identifier on to the data frame rows via their rownames.

rownames(mydata) = mydata$X1
mydata$X1 = NULL
mydata = scale(mydata)

If you then want to perform k-means on the scaled data, I would just leave the row names as the identifiers to do any analysis. You can put them back whenever you want with mydata$X1 = rownames(mydata).

like image 83
Akhil Nair Avatar answered Sep 29 '22 14:09

Akhil Nair