I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score
which works fine.
My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV
/RandomizedSearchCV
, but I can't find a simple GridSearch
/RandomizedSearch
. I can write my own but the ParameterSampler
and ParameterGrid
objects are very useful.
My next step will be to subclass BaseSearchCV
and implement my own _fit()
method, but thought it was worth asking is there a simpler way to do this, for example by passing something to the cv
parameter?
def silhouette_score(estimator, X): clusters = estimator.fit_predict(X) score = metrics.silhouette_score(distance_matrix, clusters, metric='precomputed') return score ca = KMeans() param_grid = {"n_clusters": range(2, 11)} # run randomized search search = GridSearchCV( ca, param_distributions=param_dist, n_iter=n_iter_search, scoring=silhouette_score, cv= # can I pass something here to only use a single fold? ) search.fit(distance_matrix)
Grid search builds a model for every combination of hyperparameters specified and evaluates each model. A more efficient technique for hyperparameter tuning is the Randomized search — where random combinations of the hyperparameters are used to find the best solution.
Grid search is the simplest algorithm for hyperparameter tuning. Basically, we divide the domain of the hyperparameters into a discrete grid. Then, we try every combination of values of this grid, calculating some performance metrics using cross-validation.
GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters. It is basically a cross-validation method. the model and the parameters are required to be fed in. Best parameter values are extracted and then the predictions are made.
Grid search refers to a technique used to identify the optimal hyperparameters for a model. Unlike parameters, finding hyperparameters in training data is unattainable. As such, to find the right hyperparameters, we create a model for each combination of hyperparameters.
The clusteval
library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan.
pip install clusteval
Depending on your data, the evaluation method can be chosen.
# Import library from clusteval import clusteval # Set parameters, as an example dbscan ce = clusteval(method='dbscan') # Fit to find optimal number of clusters using dbscan results= ce.fit(X) # Make plot of the cluster evaluation ce.plot() # Make scatter plot. Note that the first two coordinates are used for plotting. ce.scatter(X) # results is a dict with various output statistics. One of them are the labels. cluster_labels = results['labx']
Ok, this might be an old question but I use this kind of code:
First, we want to generate all the possible combinations of parameters:
def make_generator(parameters): if not parameters: yield dict() else: key_to_iterate = list(parameters.keys())[0] next_round_parameters = {p : parameters[p] for p in parameters if p != key_to_iterate} for val in parameters[key_to_iterate]: for pars in make_generator(next_round_parameters): temp_res = pars temp_res[key_to_iterate] = val yield temp_res
Then create a loop out of this:
# add fix parameters - here - it's just a random one fixed_params = {"max_iter":300 } param_grid = {"n_clusters": range(2, 11)} for params in make_generator(param_grid): params.update(fixed_params) ca = KMeans( **params ) ca.fit(_data) labels = ca.labels_ # Estimate your clustering labels and # make decision to save or discard it!
Of course, it can be combined in a pretty function. So this solution is mostly an example.
Hope it helps someone!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With