I'm using the k-means
algorithm from the scikit-learn
library, and the values I want to cluster are in a pandas dataframe with 3 columns: ID
, value_1
and value_2
.
I want to cluster the information using value_1
and value_2
, but I also want to keep the ID
associated with it (so I can create a list of ID
s in each cluster).
What's the best way of doing this? Currently it clusters using the ID
number as well and that's not the intention.
My current code (X
is the pandas dataframe):
kmeans = KMeans(n_clusters=2, n_init=3, max_iter=3000, random_state=1)
(X_train, X_test) = train_test_split(X[['value_1','value_2']],test_size=0.30)
kmeans = kmeans.fit(X_train)
Do the clustering using only the columns of interest (as in your example). Then add the list of labels kmeans.labels_
as another column to X_train
(or X_test
). The labels are in the same order as the original rows.
# A toy DF
X = pd.DataFrame({'id': [1,2,3,4,5],
'value_1': [1,3,1,4,5],
'value_2': [0,0,1,5,0]})
# Split ALL columns
(X_train, X_test) = train_test_split(X,test_size=0.30)
# Cluster using SOME columns
kmeans = kmeans.fit(X_train[['value_1','value_2']])
# Save the labels
X_train.loc[:,'labels'] = kmeans.labels_
Since both X_train
and X_tests
are slices of X
, you may see a warning here:
A value is trying to be set on a copy of a slice from a DataFrame.
You can ignore it.
X_train
# id value_1 value_2 labels
#4 5 5 0 0
#0 1 1 0 0
#3 4 4 5 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With