I'm analyzing a corpus of roughly 2M raw words. I build a model using gensim's word2vec, embed the vectors using sklearn TSNE, and cluster the vectors (from word2vec, not TSNE) using sklearn DBSCAN. The TSNE output looks about right: the layout of the words in 2D space seems to reflect their semantic meaning. There's a group of misspellings, clothes, etc.
However, I'm having trouble getting DBSCAN to output meaningful results. It seems to label almost everything in the "0" group (colored teal in the images). As I increase epsilon, the "0" group takes over everything. Here are screenshots with epsilon=10, and epsilon=12.5. With epsilon=20, almost everything is in the same group.
I would expect, for instance, the group of "clothing" words to all get clustered together (they're unclustered @ eps=10). I would also expect more on the order of 100 clusters, as opposed to 5 - 12 clusters, and to be able to control the size and number of the clusters using epsilon.
A few questions, then. Am I understanding the use of DBSCAN correctly? Is there another clustering algorithm that might be a better choice? How can I know what a good clustering algorithm for my data is?
Is it safe to assume my model is tuned pretty well, given that the TSNE looks about right?
What other techniques can I use in order to isolate the issue with clustering? How do I know if it's my word2vec model, my use of DBSCAN, or something else?
Here's the code I'm using to perform DBSCAN:
import sys
import gensim
import json
from optparse import OptionParser
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# snip option parsing
model = gensim.models.Word2Vec.load(options.file);
words = sorted(model.vocab.keys())
vectors = StandardScaler().fit_transform([model[w] for w in words])
db = DBSCAN(eps=options.epsilon).fit(vectors)
labels = db.labels_
core_indices = db.core_sample_indices_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Estimated {:d} clusters".format(n_clusters), file=sys.stderr)
output = [{'word': w, 'label': np.asscalar(l), 'isCore': i in core_indices} for i, (l, w) in enumerate(zip(labels, words))]
print(json.dumps(output))
It can identify clusters in large spatial datasets by looking at the local density of the data points. The most exciting feature of DBSCAN clustering is that it is robust to outliers. It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids.
It uses distance and a minimum number of points per cluster to classify a point as an outlier. Since GBIF mediated occurrence data can be very patchy, clustering is important. One advantage of DBSCAN is that it does not need to know the expected number of clusters in advance.
There are two key parameters of DBSCAN : eps: The distance that specifies the neighborhoods. Two points are considered to be neighbors if the distance between them are less than or equal to eps. minPts: Minimum number of data points to define a cluster.
I'm having the same problem and trying these solutions, posting it here hoping it could help you or someone else:
min_samples
value in DBSCAN to your problem, in my case the default value, 4, was too high as some clusters could also be formed by 2 words.Perhaps DBSCAN is not the better choiche, I am also approaching K-Means for this problem
Iterating the creation of the model also helped me understand better which parameters to choose:
for eps in np.arange(0.1, 50, 0.1):
dbscan_model = DBSCAN(eps=eps, min_samples=3, metric_params=None, algorithm="auto", leaf_size=30, p=None, n_jobs=1)
labels = dbscan_model.fit_predict(mat_words)
clusters = {}
for i, w in enumerate(words_found):
clusters[w] = labels[i]
dbscan_clusters = sorted(clusters.items(), key=operator.itemgetter(1))
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = len([lab for lab in labels if lab == -1])
print("EPS: ", eps, "\tClusters: ", n_clusters, "\tNoise: ", n_noise)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With