I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7): <pre class="prettyprint"><code>from sklearn.cluster import DBSCAN dbscan = DBSCAN(random_state=0) dbscan.fit(X) </code></pre> However, I found that there was no built-in function (aside from "fit_predict") that could assign the new data points, Y, to the clusters identified in the original data, X. The K-means method has a "predict" function but I want to be able to do the same with DBSCAN. Something like this: <pre class="prettyprint"><code>dbscan.predict(X, Y) </code></pre> So that the density can be inferred from X but the return values (cluster assignments/labels) are only for Y. From what I can tell, this capability is available in R so I assume that it is also somehow available in Python. I just can't seem to find any documentation for this. Also, I have tried searching for reasons as to why DBSCAN may not be used for labeling new data but I haven't found any justifications.

While Anony-Mousse has some good points (Clustering is indeed not classifying) I think the ability of assigning new points has it's usefulness. * Based on the original paper on DBSCAN and robertlaytons ideas on github.com/scikit-learn, I suggest running through core points and assigning to the cluster of the first core point that is within <code>eps</code> of you new point. Then it is guaranteed that your point will at least be a border point of the assigned cluster according to the definitions used for the clustering. (Be aware that your point might be deemed noise and not assigned to a cluster) I've done a quick implementation: <pre class="prettyprint"><code>import numpy as np import scipy as sp def dbscan_predict(dbscan_model, X_new, metric=sp.spatial.distance.cosine): # Result is noise by default y_new = np.ones(shape=len(X_new), dtype=int)*-1 # Iterate all input samples for a label for j, x_new in enumerate(X_new): # Find a core sample closer than EPS for i, x_core in enumerate(dbscan_model.components_): if metric(x_new, x_core) < dbscan_model.eps: # Assign label of x_core to x_new y_new[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]] break return y_new </code></pre> The labels obtained by clustering (<code>dbscan_model = DBSCAN(...).fit(X)</code> and the labels obtained from the same model on the same data (<code>dbscan_predict(dbscan_model, X)</code>) sometimes differ. I'm not quite certain if this is a bug somewhere or a result of randomness. EDIT: I Think the above problem of differing prediction outcomes could stem from the possibility that a border point can be close to multiple clusters. Please update if you test this and find an answer. Ambiguity might be solved by shuffling core points every time or by picking the closest instead of the first core point. *) Case at hand: I'd like to evaluate if the clusters obtained from a subset of my data makes sense for other subset or is simply a special case. If it generalises it supports the validity of the clusters and the earlier steps of pre-processing applied.

scikit-learn: Predicting new points with DBSCAN

Tags:

machine-learning

cluster-analysis

scikit-learn

data-mining

dbscan

I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7):

from sklearn.cluster import DBSCAN dbscan = DBSCAN(random_state=0) dbscan.fit(X)

However, I found that there was no built-in function (aside from "fit_predict") that could assign the new data points, Y, to the clusters identified in the original data, X. The K-means method has a "predict" function but I want to be able to do the same with DBSCAN. Something like this:

dbscan.predict(X, Y)

So that the density can be inferred from X but the return values (cluster assignments/labels) are only for Y. From what I can tell, this capability is available in R so I assume that it is also somehow available in Python. I just can't seem to find any documentation for this.

Also, I have tried searching for reasons as to why DBSCAN may not be used for labeling new data but I haven't found any justifications.

779

asked Jan 07 '15 15:01

slaw

1 Answers

While Anony-Mousse has some good points (Clustering is indeed not classifying) I think the ability of assigning new points has it's usefulness. ^*

Based on the original paper on DBSCAN and robertlaytons ideas on github.com/scikit-learn, I suggest running through core points and assigning to the cluster of the first core point that is within eps of you new point. Then it is guaranteed that your point will at least be a border point of the assigned cluster according to the definitions used for the clustering. (Be aware that your point might be deemed noise and not assigned to a cluster)

I've done a quick implementation:

import numpy as np import scipy as sp  def dbscan_predict(dbscan_model, X_new, metric=sp.spatial.distance.cosine):     # Result is noise by default     y_new = np.ones(shape=len(X_new), dtype=int)*-1       # Iterate all input samples for a label     for j, x_new in enumerate(X_new):         # Find a core sample closer than EPS         for i, x_core in enumerate(dbscan_model.components_):              if metric(x_new, x_core) < dbscan_model.eps:                 # Assign label of x_core to x_new                 y_new[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]]                 break      return y_new

The labels obtained by clustering (dbscan_model = DBSCAN(...).fit(X) and the labels obtained from the same model on the same data (dbscan_predict(dbscan_model, X)) sometimes differ. I'm not quite certain if this is a bug somewhere or a result of randomness.

EDIT: I Think the above problem of differing prediction outcomes could stem from the possibility that a border point can be close to multiple clusters. Please update if you test this and find an answer. Ambiguity might be solved by shuffling core points every time or by picking the closest instead of the first core point.

*) Case at hand: I'd like to evaluate if the clusters obtained from a subset of my data makes sense for other subset or is simply a special case. If it generalises it supports the validity of the clusters and the earlier steps of pre-processing applied.

195

answered Oct 21 '22 17:10

kidmose

Related questions
                            
                                Tensorflow: restoring a graph and model then running evaluation on a single image
                            
                                How does keras define "accuracy" and "loss"?
                            
                                Choosing between GeForce or Quadro GPUs to do machine learning via TensorFlow
                            
                                Scikit-learn, get accuracy scores for each class
                            
                                Restore original text from Keras’s imdb dataset
                            
                                Why is weight vector orthogonal to decision plane in neural networks
                            
                                How to insert Keras model into scikit-learn pipeline?
                            
                                Real world typo statistics? [closed]
                            
                                How to serve a Spark MLlib model?
                            
                                What is "epoch" in keras.models.Model.fit?
                            
                                Deep Belief Networks vs Convolutional Neural Networks
                            
                                Recommended package for very large dataset processing and machine learning in R [closed]
                            
                                Can Keras deal with input images with different size?
                            
                                Publicly Available Spam Filter Training Set [closed]
                            
                                setting values for ntree and mtry for random forest regression model
                            
                                What's the difference between scikit-learn and tensorflow? Is it possible to use them together?
                            
                                How Could One Implement the K-Means++ Algorithm?
                            
                                ModuleNotFoundError: No module named 'numpy.testing.nosetester'
                            
                                LSTM Autoencoder
                            
                                Why input is scaled in tf.nn.dropout in tensorflow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With