I'm trying to cluster some text documents using <code>scikit-learn</code>. I'm trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g. <code>bandwidth</code> for MeanShift and <code>eps</code> for DBSCAN) best work for the kind of data I'm using (news articles). I have some testing data which consists of pre-labeled clusters. I have been trying to use <code>scikit-learn</code>'s <code>GridSearchCV</code> but don't understand how (or if it can) be applied in this case, since it needs the test data to be split, but I want to run the evaluation on the entire dataset and compare the results to the pre-labeled data. I have been trying to specify a scoring function which compares the estimator's labels to the true labels, but of course it doesn't work because only a sample of the data has been clustered, not all of it. What's an appropriate approach here?

Have you considered implementing the search yourself? It's not particularly hard to implement a for loop. Even if you want to optimize two parameters it's still fairly easy. For both DBSCAN and MeanShift I do however advise to first understand your similarity measure. It makes more sense to choose the parameters based on an understanding of your measure instead of parameter optimization to match some labels (which has a high risk of overfitting). In other words, at which distance are two articles supposed to be clustered? If this distance varies too much from one data point to another, these algorithms will fail badly; and you may need to find a normalized distance function such that the actual similarity values are meaningful again. TF-IDF is standard on text, but mostly in a retrieval context. They may work much worse in a clustering context. Also beware that MeanShift (similar to k-means) needs to recompute coordinates - on text data, this may yield undesired results; where the updated coordinates actually got worse, instead of better.

How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

I'm trying to cluster some text documents using scikit-learn. I'm trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g. bandwidth for MeanShift and eps for DBSCAN) best work for the kind of data I'm using (news articles).

I have some testing data which consists of pre-labeled clusters. I have been trying to use scikit-learn's GridSearchCV but don't understand how (or if it can) be applied in this case, since it needs the test data to be split, but I want to run the evaluation on the entire dataset and compare the results to the pre-labeled data.

I have been trying to specify a scoring function which compares the estimator's labels to the true labels, but of course it doesn't work because only a sample of the data has been clustered, not all of it.

What's an appropriate approach here?

How does density-based clustering work?

The Density-based Clustering tool works by detecting areas where points are concentrated and where they are separated by areas that are empty or sparse. Points that are not part of a cluster are labeled as noise.

What are the two parameters for DBSCAN?

DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region (minPts).

The following function for DBSCAN might help. I've written it to iterate over the hyperparameters eps and min_samples and included optional arguments for min and max clusters. As DBSCAN is unsupervised, I have not included an evaluation parameter.

def dbscan_grid_search(X_data, lst, clst_count, eps_space = 0.5,
                       min_samples_space = 5, min_clust = 0, max_clust = 10):

    """
Performs a hyperparameter grid search for DBSCAN.

Parameters:
    * X_data            = data used to fit the DBSCAN instance
    * lst               = a list to store the results of the grid search
    * clst_count        = a list to store the number of non-whitespace clusters
    * eps_space         = the range values for the eps parameter
    * min_samples_space = the range values for the min_samples parameter
    * min_clust         = the minimum number of clusters required after each search iteration in order for a result to be appended to the lst
    * max_clust         = the maximum number of clusters required after each search iteration in order for a result to be appended to the lst


Example:

# Loading Libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Loading iris dataset
iris = datasets.load_iris()
X = iris.data[:, :] 
y = iris.target

# Scaling X data
dbscan_scaler = StandardScaler()

dbscan_scaler.fit(X)

dbscan_X_scaled = dbscan_scaler.transform(X)

# Setting empty lists in global environment
dbscan_clusters = []
cluster_count   = []


# Inputting function parameters
dbscan_grid_search(X_data = dbscan_X_scaled,
                   lst = dbscan_clusters,
                   clst_count = cluster_count
                   eps_space = pd.np.arange(0.1, 5, 0.1),
                   min_samples_space = pd.np.arange(1, 50, 1),
                   min_clust = 3,
                   max_clust = 6)

"""

    # Importing counter to count the amount of data in each cluster
    from collections import Counter


    # Starting a tally of total iterations
    n_iterations = 0


    # Looping over each combination of hyperparameters
    for eps_val in eps_space:
        for samples_val in min_samples_space:

            dbscan_grid = DBSCAN(eps = eps_val,
                                 min_samples = samples_val)


            # fit_transform
            clusters = dbscan_grid.fit_predict(X = X_data)


            # Counting the amount of data in each cluster
            cluster_count = Counter(clusters)


            # Saving the number of clusters
            n_clusters = sum(abs(pd.np.unique(clusters))) - 1


            # Increasing the iteration tally with each run of the loop
            n_iterations += 1


            # Appending the lst each time n_clusters criteria is reached
            if n_clusters >= min_clust and n_clusters <= max_clust:

                dbscan_clusters.append([eps_val,
                                        samples_val,
                                        n_clusters])


                clst_count.append(cluster_count)

    # Printing grid search summary information
    print(f"""Search Complete. \nYour list is now of length {len(lst)}. """)
    print(f"""Hyperparameter combinations checked: {n_iterations}. \n""")

Have you considered implementing the search yourself?

It's not particularly hard to implement a for loop. Even if you want to optimize two parameters it's still fairly easy.

For both DBSCAN and MeanShift I do however advise to first understand your similarity measure. It makes more sense to choose the parameters based on an understanding of your measure instead of parameter optimization to match some labels (which has a high risk of overfitting).

In other words, at which distance are two articles supposed to be clustered?

If this distance varies too much from one data point to another, these algorithms will fail badly; and you may need to find a normalized distance function such that the actual similarity values are meaningful again. TF-IDF is standard on text, but mostly in a retrieval context. They may work much worse in a clustering context.

Also beware that MeanShift (similar to k-means) needs to recompute coordinates - on text data, this may yield undesired results; where the updated coordinates actually got worse, instead of better.

How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

Tags:

flutter

cluster-analysis

scikit-learn

frnsys

People also ask

2 Answers

Chase

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us

How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

Tags:

flutter

cluster-analysis

scikit-learn

frnsys

People also ask

2 Answers

Chase

Has QUIT--Anony-Mousse

Related questions

Recent Activity

Donate For Us