Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grid search for hyperparameter evaluation of clustering in scikit-learn

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine.

My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV/RandomizedSearchCV, but I can't find a simple GridSearch/RandomizedSearch. I can write my own but the ParameterSampler and ParameterGrid objects are very useful.

My next step will be to subclass BaseSearchCV and implement my own _fit() method, but thought it was worth asking is there a simpler way to do this, for example by passing something to the cv parameter?

def silhouette_score(estimator, X):     clusters = estimator.fit_predict(X)     score = metrics.silhouette_score(distance_matrix, clusters, metric='precomputed')     return score  ca = KMeans() param_grid = {"n_clusters": range(2, 11)}  # run randomized search search = GridSearchCV(     ca,     param_distributions=param_dist,     n_iter=n_iter_search,     scoring=silhouette_score,     cv= # can I pass something here to only use a single fold?     ) search.fit(distance_matrix) 
like image 207
Jamie Bull Avatar asked Jan 05 '16 11:01

Jamie Bull


People also ask

How is the grid search method used in hyperparameter optimization?

Grid search builds a model for every combination of hyperparameters specified and evaluates each model. A more efficient technique for hyperparameter tuning is the Randomized search — where random combinations of the hyperparameters are used to find the best solution.

Is grid search used for hyperparameter tuning?

Grid search is the simplest algorithm for hyperparameter tuning. Basically, we divide the domain of the hyperparameters into a discrete grid. Then, we try every combination of values of this grid, calculating some performance metrics using cross-validation.

What is a GridSearchCV in Sklearn?

GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters. It is basically a cross-validation method. the model and the parameters are required to be fed in. Best parameter values are extracted and then the predictions are made.

What is grid search used for?

Grid search refers to a technique used to identify the optimal hyperparameters for a model. Unlike parameters, finding hyperparameters in training data is unattainable. As such, to find the right hyperparameters, we create a model for each combination of hyperparameters.


2 Answers

The clusteval library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan.

pip install clusteval 

Depending on your data, the evaluation method can be chosen.

# Import library from clusteval import clusteval  # Set parameters, as an example dbscan ce = clusteval(method='dbscan')  # Fit to find optimal number of clusters using dbscan results= ce.fit(X)  # Make plot of the cluster evaluation ce.plot()  # Make scatter plot. Note that the first two coordinates are used for plotting. ce.scatter(X)  # results is a dict with various output statistics. One of them are the labels. cluster_labels = results['labx'] 
like image 72
erdogant Avatar answered Sep 19 '22 13:09

erdogant


Ok, this might be an old question but I use this kind of code:

First, we want to generate all the possible combinations of parameters:

def make_generator(parameters):     if not parameters:         yield dict()     else:         key_to_iterate = list(parameters.keys())[0]         next_round_parameters = {p : parameters[p]                     for p in parameters if p != key_to_iterate}         for val in parameters[key_to_iterate]:             for pars in make_generator(next_round_parameters):                 temp_res = pars                 temp_res[key_to_iterate] = val                 yield temp_res 

Then create a loop out of this:

# add fix parameters - here - it's just a random one fixed_params = {"max_iter":300 }   param_grid = {"n_clusters": range(2, 11)}  for params in make_generator(param_grid):     params.update(fixed_params)     ca = KMeans( **params )     ca.fit(_data)     labels = ca.labels_     # Estimate your clustering labels and      # make decision to save or discard it! 

Of course, it can be combined in a pretty function. So this solution is mostly an example.

Hope it helps someone!

like image 40
Alexander B. Avatar answered Sep 23 '22 13:09

Alexander B.