Troubleshooting tips for clustering word2vec output with DBSCAN

Tags:

I'm analyzing a corpus of roughly 2M raw words. I build a model using gensim's word2vec, embed the vectors using sklearn TSNE, and cluster the vectors (from word2vec, not TSNE) using sklearn DBSCAN. The TSNE output looks about right: the layout of the words in 2D space seems to reflect their semantic meaning. There's a group of misspellings, clothes, etc.

However, I'm having trouble getting DBSCAN to output meaningful results. It seems to label almost everything in the "0" group (colored teal in the images). As I increase epsilon, the "0" group takes over everything. Here are screenshots with epsilon=10, and epsilon=12.5. With epsilon=20, almost everything is in the same group.

epsilon 10 epsilon 12.5

I would expect, for instance, the group of "clothing" words to all get clustered together (they're unclustered @ eps=10). I would also expect more on the order of 100 clusters, as opposed to 5 - 12 clusters, and to be able to control the size and number of the clusters using epsilon.

A few questions, then. Am I understanding the use of DBSCAN correctly? Is there another clustering algorithm that might be a better choice? How can I know what a good clustering algorithm for my data is?

Is it safe to assume my model is tuned pretty well, given that the TSNE looks about right?

What other techniques can I use in order to isolate the issue with clustering? How do I know if it's my word2vec model, my use of DBSCAN, or something else?

Here's the code I'm using to perform DBSCAN:

import sys
import gensim
import json
from optparse import OptionParser

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# snip option parsing

model = gensim.models.Word2Vec.load(options.file);
words = sorted(model.vocab.keys())
vectors = StandardScaler().fit_transform([model[w] for w in words])

db = DBSCAN(eps=options.epsilon).fit(vectors)
labels = db.labels_
core_indices = db.core_sample_indices_

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Estimated {:d} clusters".format(n_clusters), file=sys.stderr)

output = [{'word': w, 'label': np.asscalar(l), 'isCore': i in core_indices} for i, (l, w) in enumerate(zip(labels, words))]
print(json.dumps(output))

454

asked Jan 23 '17 21:01

Ian

1 Answers

I'm having the same problem and trying these solutions, posting it here hoping it could help you or someone else:

Adapting the min_samples value in DBSCAN to your problem, in my case the default value, 4, was too high as some clusters could also be formed by 2 words.
Obviously, starting from a better corpus could be the solution to your problem, if the model is badly initialized, it won't perform
Perhaps DBSCAN is not the better choiche, I am also approaching K-Means for this problem

Iterating the creation of the model also helped me understand better which parameters to choose:

for eps in np.arange(0.1, 50, 0.1):
    dbscan_model = DBSCAN(eps=eps, min_samples=3, metric_params=None, algorithm="auto", leaf_size=30, p=None, n_jobs=1)
    labels = dbscan_model.fit_predict(mat_words)

    clusters = {}
    for i, w in enumerate(words_found):
        clusters[w] = labels[i]
    dbscan_clusters = sorted(clusters.items(), key=operator.itemgetter(1))
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = len([lab for lab in labels if lab == -1])
    print("EPS: ", eps, "\tClusters: ", n_clusters, "\tNoise: ", n_noise)

answered Nov 03 '22 01:11

Nicolò Gasparini

Related questions
                            
                                Use context manager as a function
                            
                                Simple MLP time series training yields unexpeced mean line results
                            
                                pip install bs4 giving _socketobject error
                            
                                How to make test case fail if a django template has a rendering error that would silently fail in production
                            
                                How do I pickle a dictionary containing a module & class?
                            
                                Generate N positive integers within a range adding up to a total in python
                            
                                Sending attachment in HTML email with Python
                            
                                `TypeError: argument 2 must be a connection, cursor or None` in Psycopg2
                            
                                Pyspark - Sum over multiple sparse vectors (CountVectorizer Output)
                            
                                Selenium Remote Webdriver with remote profile
                            
                                django.db.utils.OperationalError: server closed the connection unexpectedly
                            
                                AWS Redis + uWSGI behind NGINX - high load
                            
                                Fonts Corrupted
                            
                                How to find all uses of a python function or variable in a python package
                            
                                Outliers using RPCA
                            
                                Trained keras model much slower making its predictions than in training
                            
                                Adding a property to an int value in python
                            
                                Unable to locate nested geopoint after updating to elasticsearch 2.3
                            
                                Calling a stateful LSTM as a functional model?
                            
                                Share Python logger across multiple files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Troubleshooting tips for clustering word2vec output with DBSCAN

Tags:

python

machine-learning

scikit-learn

gensim

word2vec

Ian

People also ask

1 Answers

Nicolò Gasparini

Recent Activity

Donate For Us