I have data frame as follows:
X1 X2 X3
3 102.20000 26.07667
4 115.00000 25.12500
5 36.70000 35.05545
Where column X1 denotes unique identifier for a row while X2, X3 are features
I want to perform scaling before performing k means clustering on a data,
mydata <- scale(mydata)
X1 X2 X3
-11715.6 -12.2200734 -9.7826627
-11714.6 0.5799266 -10.7343294
-11713.6 -77.7200734 -0.8038748
I don't want column X1 to scale but want it to remain on data frame. Any way to do it?
k-means has trouble clustering data where clusters are of varying sizes and density. To cluster such data, you need to generalize k-means as described in the Advantages section. Clustering outliers. Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored.
The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, is common in many applications.
There are two ways to avoid the problem of initialization sensitivity: Repeat K means: It basically repeats the algorithm again and again along with initializing the centroids followed by picking up the cluster which results in the small intracluster distance and large intercluster distance.
The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate 25 initial configurations. This approach is often recommended.
You can tag the unique identifier on to the data frame rows via their rownames
.
rownames(mydata) = mydata$X1
mydata$X1 = NULL
mydata = scale(mydata)
If you then want to perform k-means
on the scaled data, I would just leave the row names as the identifiers to do any analysis. You can put them back whenever you want with mydata$X1 = rownames(mydata)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With