I have a question about kmeans clustering in python.
So I did the analysis that way:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=12, random_state=1)
new = data._get_numeric_data().dropna(axis=1)
km.fit(new)
predict=km.predict(new)
How can I add the column with cluster results to my first dataframe "data" as an additional column? Thanks!
Step-1: Select the value of K, to decide the number of clusters to be formed. Step-2: Select random K points which will act as centroids. Step-3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.
Visually we can see that the optimal number of clusters should be around 3. But visualizing the data alone cannot always give the right answer. The curve looks like an elbow. In the above plot, the elbow is at k=3 (i.e. Sum of squared distances falls suddenly) indicating the optimal k for this dataset is 3.
Assuming the column length is as the same as each column in you dataframe df
, all you need to do is this:
df['NEW_COLUMN'] = pd.Series(predict, index=df.index)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With