Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

K-means using only specific dataframe columns with scikit-learn

I'm using the k-means algorithm from the scikit-learn library, and the values I want to cluster are in a pandas dataframe with 3 columns: ID, value_1 and value_2.

I want to cluster the information using value_1 and value_2, but I also want to keep the ID associated with it (so I can create a list of IDs in each cluster).

What's the best way of doing this? Currently it clusters using the ID number as well and that's not the intention.

My current code (X is the pandas dataframe):

kmeans = KMeans(n_clusters=2, n_init=3, max_iter=3000, random_state=1)
(X_train, X_test) = train_test_split(X[['value_1','value_2']],test_size=0.30)
kmeans = kmeans.fit(X_train)
like image 218
Jessica Chambers Avatar asked Aug 14 '18 21:08

Jessica Chambers


1 Answers

Do the clustering using only the columns of interest (as in your example). Then add the list of labels kmeans.labels_ as another column to X_train (or X_test). The labels are in the same order as the original rows.

# A toy DF
X = pd.DataFrame({'id': [1,2,3,4,5],
                  'value_1': [1,3,1,4,5],
                  'value_2': [0,0,1,5,0]})

# Split ALL columns
(X_train, X_test) = train_test_split(X,test_size=0.30)
# Cluster using SOME columns
kmeans = kmeans.fit(X_train[['value_1','value_2']])
# Save the labels
X_train.loc[:,'labels'] = kmeans.labels_

Since both X_train and X_tests are slices of X, you may see a warning here:

A value is trying to be set on a copy of a slice from a DataFrame.

You can ignore it.

X_train
#   id  value_1  value_2  labels
#4   5        5        0       0
#0   1        1        0       0
#3   4        4        5       1
like image 137
DYZ Avatar answered Oct 01 '22 21:10

DYZ