I'm having trouble computing the silhouette coefficient in python with sklearn. Here is my code : <pre class="prettyprint"><code>from sklearn import datasets from sklearn.metrics import * iris = datasets.load_iris() X = pd.DataFrame(iris.data, columns = col) y = pd.DataFrame(iris.target,columns = ['cluster']) s = silhouette_score(X, y, metric='euclidean',sample_size=int(50)) </code></pre> I get the error : <pre class="prettyprint"><code>IndexError: indices are out-of-bounds </code></pre> I want to use the sample_size parameter because when working with very large datasets, silhouette is too long to compute. Anyone knows how this parameter could work ? Complete traceback : <pre class="prettyprint"><code>--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-72-70ff40842503> in <module>() 4 X = pd.DataFrame(iris.data, columns = col) 5 y = pd.DataFrame(iris.target,columns = ['cluster']) ----> 6 s = silhouette_score(X, y, metric='euclidean',sample_size=50) /usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds) 81 X, labels = X[indices].T[indices].T, labels[indices] 82 else: ---> 83 X, labels = X[indices], labels[indices] 84 return np.mean(silhouette_samples(X, labels, metric=metric, **kwds)) 85 /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key) 1993 if isinstance(key, (np.ndarray, list)): 1994 # either boolean or fancy integer index -> 1995 return self._getitem_array(key) 1996 elif isinstance(key, DataFrame): 1997 return self._getitem_frame(key) /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_array(self, key) 2030 else: 2031 indexer = self.ix._convert_to_indexer(key, axis=1) -> 2032 return self.take(indexer, axis=1, convert=True) 2033 2034 def _getitem_multilevel(self, key): /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in take(self, indices, axis, convert) 2981 if convert: 2982 axis = self._get_axis_number(axis) -> 2983 indices = _maybe_convert_indices(indices, len(self._get_axis(axis))) 2984 2985 if self._is_mixed_type: /usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _maybe_convert_indices(indices, n) 1038 mask = (indices>=n) | (indices<0) 1039 if mask.any(): -> 1040 raise IndexError("indices are out-of-bounds") 1041 return indices 1042 IndexError: indices are out-of-bounds </code></pre>

silhouette_score expects regular numpy arrays as input. Why wrap your arrays in data frames? <pre class="prettyprint"><code>>>> silhouette_score(iris.data, iris.target, sample_size=50) 0.52999903616584543 </code></pre> From the traceback, you can observe that the code is doing fancy indexing (subsampling) on the first axis. By default indexing a dataframe will index the columns and not the rows hence the issue you observe.

silhouette coefficient in python with sklearn

Tags:

python

cluster-analysis

scikit-learn

I'm having trouble computing the silhouette coefficient in python with sklearn. Here is my code :

from sklearn import datasets
from sklearn.metrics import *
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = col)
y = pd.DataFrame(iris.target,columns = ['cluster'])
s = silhouette_score(X, y, metric='euclidean',sample_size=int(50))

I get the error :

IndexError: indices are out-of-bounds

I want to use the sample_size parameter because when working with very large datasets, silhouette is too long to compute. Anyone knows how this parameter could work ?

Complete traceback :

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-72-70ff40842503> in <module>()
      4 X = pd.DataFrame(iris.data, columns = col)
      5 y = pd.DataFrame(iris.target,columns = ['cluster'])
----> 6 s = silhouette_score(X, y, metric='euclidean',sample_size=50)

/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
     81             X, labels = X[indices].T[indices].T, labels[indices]
     82         else:
---> 83             X, labels = X[indices], labels[indices]
     84     return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
     85 

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1993         if isinstance(key, (np.ndarray, list)):
   1994             # either boolean or fancy integer index
-> 1995             return self._getitem_array(key)
   1996         elif isinstance(key, DataFrame):
   1997             return self._getitem_frame(key)

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_array(self, key)
   2030         else:
   2031             indexer = self.ix._convert_to_indexer(key, axis=1)
-> 2032             return self.take(indexer, axis=1, convert=True)
   2033 
   2034     def _getitem_multilevel(self, key):

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in take(self, indices, axis, convert)
   2981         if convert:
   2982             axis = self._get_axis_number(axis)
-> 2983             indices = _maybe_convert_indices(indices, len(self._get_axis(axis)))
   2984 
   2985         if self._is_mixed_type:

/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _maybe_convert_indices(indices, n)
   1038     mask = (indices>=n) | (indices<0)
   1039     if mask.any():
-> 1040         raise IndexError("indices are out-of-bounds")
   1041     return indices
   1042 

IndexError: indices are out-of-bounds

273

asked Dec 04 '13 11:12

Scratch

1 Answers

silhouette_score expects regular numpy arrays as input. Why wrap your arrays in data frames?

>>> silhouette_score(iris.data, iris.target, sample_size=50)
0.52999903616584543

From the traceback, you can observe that the code is doing fancy indexing (subsampling) on the first axis. By default indexing a dataframe will index the columns and not the rows hence the issue you observe.

answered Oct 12 '22 15:10

ogrisel

Related questions
                            
                                matshow with sparse matrices
                            
                                Write to CSV from sqlite3 database in python
                            
                                matplotlib contour input array order
                            
                                Technique for using std::ifstream, std::ofstream in python via SWIG?
                            
                                Python string identity: `is` and `in` statements [duplicate]
                            
                                how to specify a range in numpy.piecewise (2 conditions per range)
                            
                                Regression along a dimension in a numpy array
                            
                                sqlalchemy session not getting removed properly in flask testing
                            
                                Convert array of single integer pixels to RGB triplets in Python
                            
                                Memory error allocating list of 11,464,882 empty dicts
                            
                                Why isn't my database working in this Python/Django app?
                            
                                Remove rows in 3D numpy array
                            
                                Is there a good way to avoid memory deep copy or to reduce time spent in multiprocessing?
                            
                                Time delay Tkinter
                            
                                Looking for a simple OpenGL (3.2+) Python example that uses GLFW [closed]
                            
                                How do I list all instantiated objects in Python?
                            
                                Python: How does multiple assignments in a single line work?
                            
                                Get indices of numpy.argmax elements over an axis
                            
                                Logging with multiprocessing madness
                            
                                Python: use the same class instance in multiple modules

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With