K-means using only specific dataframe columns with scikit-learn

Question

I'm using the k-means algorithm from the scikit-learn library, and the values I want to cluster are in a pandas dataframe with 3 columns: ID, value_1 and value_2.

I want to cluster the information using value_1 and value_2, but I also want to keep the ID associated with it (so I can create a list of IDs in each cluster).

What's the best way of doing this? Currently it clusters using the ID number as well and that's not the intention.

My current code (X is the pandas dataframe):

kmeans = KMeans(n_clusters=2, n_init=3, max_iter=3000, random_state=1)
(X_train, X_test) = train_test_split(X[['value_1','value_2']],test_size=0.30)
kmeans = kmeans.fit(X_train)

DYZ · Accepted Answer

Do the clustering using only the columns of interest (as in your example). Then add the list of labels kmeans.labels_ as another column to X_train (or X_test). The labels are in the same order as the original rows.

# A toy DF
X = pd.DataFrame({'id': [1,2,3,4,5],
                  'value_1': [1,3,1,4,5],
                  'value_2': [0,0,1,5,0]})

# Split ALL columns
(X_train, X_test) = train_test_split(X,test_size=0.30)
# Cluster using SOME columns
kmeans = kmeans.fit(X_train[['value_1','value_2']])
# Save the labels
X_train.loc[:,'labels'] = kmeans.labels_

Since both X_train and X_tests are slices of X, you may see a warning here:

A value is trying to be set on a copy of a slice from a DataFrame.

You can ignore it.

X_train
#   id  value_1  value_2  labels
#4   5        5        0       0
#0   1        1        0       0
#3   4        4        5       1

K-means using only specific dataframe columns with scikit-learn

Tags:

python

pandas

k-means

scikit-learn

Jessica Chambers

1 Answers

DYZ

Recent Activity

Donate For Us

K-means using only specific dataframe columns with scikit-learn

Tags:

python

pandas

k-means

scikit-learn

Jessica Chambers

1 Answers

DYZ

Related questions

Recent Activity

Donate For Us