Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Troubleshooting tips for clustering word2vec output with DBSCAN

I'm analyzing a corpus of roughly 2M raw words. I build a model using gensim's word2vec, embed the vectors using sklearn TSNE, and cluster the vectors (from word2vec, not TSNE) using sklearn DBSCAN. The TSNE output looks about right: the layout of the words in 2D space seems to reflect their semantic meaning. There's a group of misspellings, clothes, etc.

However, I'm having trouble getting DBSCAN to output meaningful results. It seems to label almost everything in the "0" group (colored teal in the images). As I increase epsilon, the "0" group takes over everything. Here are screenshots with epsilon=10, and epsilon=12.5. With epsilon=20, almost everything is in the same group.

epsilon 10 epsilon 12.5

I would expect, for instance, the group of "clothing" words to all get clustered together (they're unclustered @ eps=10). I would also expect more on the order of 100 clusters, as opposed to 5 - 12 clusters, and to be able to control the size and number of the clusters using epsilon.

A few questions, then. Am I understanding the use of DBSCAN correctly? Is there another clustering algorithm that might be a better choice? How can I know what a good clustering algorithm for my data is?

Is it safe to assume my model is tuned pretty well, given that the TSNE looks about right?

What other techniques can I use in order to isolate the issue with clustering? How do I know if it's my word2vec model, my use of DBSCAN, or something else?

Here's the code I'm using to perform DBSCAN:

import sys
import gensim
import json
from optparse import OptionParser

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# snip option parsing

model = gensim.models.Word2Vec.load(options.file);
words = sorted(model.vocab.keys())
vectors = StandardScaler().fit_transform([model[w] for w in words])

db = DBSCAN(eps=options.epsilon).fit(vectors)
labels = db.labels_
core_indices = db.core_sample_indices_

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Estimated {:d} clusters".format(n_clusters), file=sys.stderr)

output = [{'word': w, 'label': np.asscalar(l), 'isCore': i in core_indices} for i, (l, w) in enumerate(zip(labels, words))]
print(json.dumps(output))
like image 454
Ian Avatar asked Jan 23 '17 21:01

Ian


People also ask

Why is DBSCAN robust to outliers?

It can identify clusters in large spatial datasets by looking at the local density of the data points. The most exciting feature of DBSCAN clustering is that it is robust to outliers. It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids.

How does DBSCAN identify outliers?

It uses distance and a minimum number of points per cluster to classify a point as an outlier. Since GBIF mediated occurrence data can be very patchy, clustering is important. One advantage of DBSCAN is that it does not need to know the expected number of clusters in advance.

What are the two parameters of the DBSCAN algorithm in detecting outliers?

There are two key parameters of DBSCAN : eps: The distance that specifies the neighborhoods. Two points are considered to be neighbors if the distance between them are less than or equal to eps. minPts: Minimum number of data points to define a cluster.


1 Answers

I'm having the same problem and trying these solutions, posting it here hoping it could help you or someone else:

  • Adapting the min_samples value in DBSCAN to your problem, in my case the default value, 4, was too high as some clusters could also be formed by 2 words.
  • Obviously, starting from a better corpus could be the solution to your problem, if the model is badly initialized, it won't perform
  • Perhaps DBSCAN is not the better choiche, I am also approaching K-Means for this problem

  • Iterating the creation of the model also helped me understand better which parameters to choose:

    for eps in np.arange(0.1, 50, 0.1):
        dbscan_model = DBSCAN(eps=eps, min_samples=3, metric_params=None, algorithm="auto", leaf_size=30, p=None, n_jobs=1)
        labels = dbscan_model.fit_predict(mat_words)
    
        clusters = {}
        for i, w in enumerate(words_found):
            clusters[w] = labels[i]
        dbscan_clusters = sorted(clusters.items(), key=operator.itemgetter(1))
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        n_noise = len([lab for lab in labels if lab == -1])
        print("EPS: ", eps, "\tClusters: ", n_clusters, "\tNoise: ", n_noise)
    
like image 73
Nicolò Gasparini Avatar answered Nov 03 '22 01:11

Nicolò Gasparini