I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7):
from sklearn.cluster import DBSCAN dbscan = DBSCAN(random_state=0) dbscan.fit(X)
However, I found that there was no built-in function (aside from "fit_predict") that could assign the new data points, Y, to the clusters identified in the original data, X. The K-means method has a "predict" function but I want to be able to do the same with DBSCAN. Something like this:
dbscan.predict(X, Y)
So that the density can be inferred from X but the return values (cluster assignments/labels) are only for Y. From what I can tell, this capability is available in R so I assume that it is also somehow available in Python. I just can't seem to find any documentation for this.
Also, I have tried searching for reasons as to why DBSCAN may not be used for labeling new data but I haven't found any justifications.
The main disavantage of DBSCAN is that is much more prone to noise, which may lead to false clustering. On the other hand, HDBSCAN focus on high density clustering, which reduces this noise clustering problem and allows a hierarchical clustering based on a decision tree approach.
DBSCAN algorithm can be abstracted in the following steps:For each core point if it is not already assigned to a cluster, create a new cluster. Find recursively all its density connected points and assign them to the same cluster as the core point.
While Anony-Mousse has some good points (Clustering is indeed not classifying) I think the ability of assigning new points has it's usefulness. *
Based on the original paper on DBSCAN and robertlaytons ideas on github.com/scikit-learn, I suggest running through core points and assigning to the cluster of the first core point that is within eps
of you new point. Then it is guaranteed that your point will at least be a border point of the assigned cluster according to the definitions used for the clustering. (Be aware that your point might be deemed noise and not assigned to a cluster)
I've done a quick implementation:
import numpy as np import scipy as sp def dbscan_predict(dbscan_model, X_new, metric=sp.spatial.distance.cosine): # Result is noise by default y_new = np.ones(shape=len(X_new), dtype=int)*-1 # Iterate all input samples for a label for j, x_new in enumerate(X_new): # Find a core sample closer than EPS for i, x_core in enumerate(dbscan_model.components_): if metric(x_new, x_core) < dbscan_model.eps: # Assign label of x_core to x_new y_new[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]] break return y_new
The labels obtained by clustering (dbscan_model = DBSCAN(...).fit(X)
and the labels obtained from the same model on the same data (dbscan_predict(dbscan_model, X)
) sometimes differ. I'm not quite certain if this is a bug somewhere or a result of randomness.
EDIT: I Think the above problem of differing prediction outcomes could stem from the possibility that a border point can be close to multiple clusters. Please update if you test this and find an answer. Ambiguity might be solved by shuffling core points every time or by picking the closest instead of the first core point.
*) Case at hand: I'd like to evaluate if the clusters obtained from a subset of my data makes sense for other subset or is simply a special case. If it generalises it supports the validity of the clusters and the earlier steps of pre-processing applied.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With