I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using <code>silhouette_score</code> which works fine. My problem here is that I don't need to use the cross-validation aspect of the <code>GridSearchCV</code>/<code>RandomizedSearchCV</code>, but I can't find a simple <code>GridSearch</code>/<code>RandomizedSearch</code>. I can write my own but the <code>ParameterSampler</code> and <code>ParameterGrid</code> objects are very useful. My next step will be to subclass <code>BaseSearchCV</code> and implement my own <code>_fit()</code> method, but thought it was worth asking is there a simpler way to do this, for example by passing something to the <code>cv</code> parameter? <pre class="prettyprint"><code>def silhouette_score(estimator, X): clusters = estimator.fit_predict(X) score = metrics.silhouette_score(distance_matrix, clusters, metric='precomputed') return score ca = KMeans() param_grid = {"n_clusters": range(2, 11)} # run randomized search search = GridSearchCV( ca, param_distributions=param_dist, n_iter=n_iter_search, scoring=silhouette_score, cv= # can I pass something here to only use a single fold? ) search.fit(distance_matrix) </code></pre>

The <code>clusteval</code> library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan. <pre class="prettyprint"><code>pip install clusteval </code></pre> Depending on your data, the evaluation method can be chosen. <pre class="prettyprint"><code># Import library from clusteval import clusteval # Set parameters, as an example dbscan ce = clusteval(method='dbscan') # Fit to find optimal number of clusters using dbscan results= ce.fit(X) # Make plot of the cluster evaluation ce.plot() # Make scatter plot. Note that the first two coordinates are used for plotting. ce.scatter(X) # results is a dict with various output statistics. One of them are the labels. cluster_labels = results['labx'] </code></pre>

Ok, this might be an old question but I use this kind of code: First, we want to generate all the possible combinations of parameters: <pre class="prettyprint"><code>def make_generator(parameters): if not parameters: yield dict() else: key_to_iterate = list(parameters.keys())[0] next_round_parameters = {p : parameters[p] for p in parameters if p != key_to_iterate} for val in parameters[key_to_iterate]: for pars in make_generator(next_round_parameters): temp_res = pars temp_res[key_to_iterate] = val yield temp_res </code></pre> Then create a loop out of this: <pre class="prettyprint"><code># add fix parameters - here - it's just a random one fixed_params = {"max_iter":300 } param_grid = {"n_clusters": range(2, 11)} for params in make_generator(param_grid): params.update(fixed_params) ca = KMeans( **params ) ca.fit(_data) labels = ca.labels_ # Estimate your clustering labels and # make decision to save or discard it! </code></pre> Of course, it can be combined in a pretty function. So this solution is mostly an example. Hope it helps someone!

Grid search for hyperparameter evaluation of clustering in scikit-learn

Tags:

python

cluster-analysis

scikit-learn

scoring

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine.

My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV/RandomizedSearchCV, but I can't find a simple GridSearch/RandomizedSearch. I can write my own but the ParameterSampler and ParameterGrid objects are very useful.

My next step will be to subclass BaseSearchCV and implement my own _fit() method, but thought it was worth asking is there a simpler way to do this, for example by passing something to the cv parameter?

def silhouette_score(estimator, X):     clusters = estimator.fit_predict(X)     score = metrics.silhouette_score(distance_matrix, clusters, metric='precomputed')     return score  ca = KMeans() param_grid = {"n_clusters": range(2, 11)}  # run randomized search search = GridSearchCV(     ca,     param_distributions=param_dist,     n_iter=n_iter_search,     scoring=silhouette_score,     cv= # can I pass something here to only use a single fold?     ) search.fit(distance_matrix)

207

asked Jan 05 '16 11:01

Jamie Bull

2 Answers

The clusteval library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan.

pip install clusteval

Depending on your data, the evaluation method can be chosen.

# Import library from clusteval import clusteval  # Set parameters, as an example dbscan ce = clusteval(method='dbscan')  # Fit to find optimal number of clusters using dbscan results= ce.fit(X)  # Make plot of the cluster evaluation ce.plot()  # Make scatter plot. Note that the first two coordinates are used for plotting. ce.scatter(X)  # results is a dict with various output statistics. One of them are the labels. cluster_labels = results['labx']

answered Sep 19 '22 13:09

erdogant

Ok, this might be an old question but I use this kind of code:

First, we want to generate all the possible combinations of parameters:

def make_generator(parameters):     if not parameters:         yield dict()     else:         key_to_iterate = list(parameters.keys())[0]         next_round_parameters = {p : parameters[p]                     for p in parameters if p != key_to_iterate}         for val in parameters[key_to_iterate]:             for pars in make_generator(next_round_parameters):                 temp_res = pars                 temp_res[key_to_iterate] = val                 yield temp_res

Then create a loop out of this:

# add fix parameters - here - it's just a random one fixed_params = {"max_iter":300 }   param_grid = {"n_clusters": range(2, 11)}  for params in make_generator(param_grid):     params.update(fixed_params)     ca = KMeans( **params )     ca.fit(_data)     labels = ca.labels_     # Estimate your clustering labels and      # make decision to save or discard it!

Of course, it can be combined in a pretty function. So this solution is mostly an example.

Hope it helps someone!

answered Sep 23 '22 13:09

Alexander B.

Related questions
                            
                                Generating all dates within a given range in python
                            
                                Pandas: print column name with missing values
                            
                                Remove namespace and prefix from xml in python using lxml
                            
                                Remove "add another" in Django admin screen
                            
                                Ensure a single instance of an application in Linux
                            
                                How can I simplify this conversion from underscore to camelcase in Python?
                            
                                replace special characters in a string python
                            
                                Convert a number to a list of integers
                            
                                Generate random colors (RGB)
                            
                                Best way to get query string from a URL in python?
                            
                                Debugging a Python Extension in Eclipse
                            
                                How to write a Twisted client plugin
                            
                                Using setup.py to install python project as a systemd service
                            
                                Accepting output of the socket generated by Python in MQL5
                            
                                pypy import clr fails on Windows
                            
                                PySpark serialization EOFError
                            
                                Compile PyPy to Exe
                            
                                How to import your package/modules from a script in bin folder in python
                            
                                Custom TensorFlow Keras optimizer
                            
                                multiprocessing: How can I ʀᴇʟɪᴀʙʟʏ redirect stdout from a child process?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With