ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score

Question

I am trying to calculate silhouette score as I find the optimal number of clusters to create, but get an error that says:

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

I am unable to understand the reason for this. Here is the code, that I am using to cluster and calculate silhouette score.

I read the csv that contains the text to be clustered and run K-Means on the n cluster values. What could be the reason I am getting this error?

  #Create cluster using K-Means
#Only creates graph
import matplotlib
#matplotlib.use('Agg')
import re
import os
import nltk, math, codecs
import csv
from nltk.corpus import stopwords
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import silhouette_score

model_name = checkpoint_save_path
loaded_model = Doc2Vec.load(model_name)

#Load the test csv file
data = pd.read_csv(test_filename)
overview = data['overview'].astype('str').tolist()
overview = filter(bool, overview)
vectors = []

def split_words(text):
  return ''.join([x if x.isalnum() or x.isspace() else " " for x in text ]).split()

def preprocess_document(text):
  sp_words = split_words(text)
  return sp_words

for i, t in enumerate(overview):
  vectors.append(loaded_model.infer_vector(preprocess_document(t)))

sse = {}
silhouette = {}


for k in range(1,15):
  km = KMeans(n_clusters=k, max_iter=1000, verbose = 0).fit(vectors)
  sse[k] = km.inertia_
  #FOLLOWING LINE CAUSES ERROR
  silhouette[k] = silhouette_score(vectors, km.labels_, metric='euclidean')

best_cluster_size = 1
min_error = float("inf")

for cluster_size in sse:
    if sse[cluster_size] < min_error:
        min_error = sse[cluster_size]
        best_cluster_size = cluster_size

print(sse)
print("====")
print(silhouette)

seralouk · Accepted Answer

The error is produced because you have a loop for different number of clusters n. During the first iteration, n_clusters is 1 and this leads to all(km.labels_ == 0)to be True.

In other words, you have only one cluster with label 0 (thus, np.unique(km.labels_) prints array([0], dtype=int32)).

`silhouette_score` requires more than 1 cluster labels. This causes the error. The error message is clear.

Example:

from sklearn import datasets
from sklearn.cluster import KMeans
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3)
km.fit(X,y)

# check how many unique labels do you have
np.unique(km.labels_)
#array([0, 1, 2], dtype=int32)

We have 3 different clusters/cluster labels.

silhouette_score(X, km.labels_, metric='euclidean')
0.38788915189699597

The function works fine.

Now, let's cause the error:

km2 = KMeans(n_clusters=1)
km2.fit(X,y)

silhouette_score(X, km2.labels_, metric='euclidean')

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

Yuan · Answer

From the documentation,

Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1

So one way to solve this problem is instead of using for k in range(1,15), try to start iteration from k = 2, which is for k in range(2,15). That works for me.

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score

Tags:

python

pandas

machine-learning

k-means

scikit-learn

Suhail Gupta

2 Answers

`silhouette_score` requires more than 1 cluster labels. This causes the error. The error message is clear.

seralouk

Yuan

Recent Activity

Donate For Us

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score

Tags:

python

pandas

machine-learning

k-means

scikit-learn

Suhail Gupta

2 Answers

silhouette_score requires more than 1 cluster labels. This causes the error. The error message is clear.

seralouk

Yuan

Related questions

Recent Activity

Donate For Us

`silhouette_score` requires more than 1 cluster labels. This causes the error. The error message is clear.