So I have my data in the form of,
X = [[T1],[T2]..] where Tn is the time series of nth user.
I want to cluster these time series using the DBSCAN method using the scikit-learn library in python. When I try to directly fit the data, I get the output as -1 for all objects, with various values of epsilon and min-points.
What is the correct way to procees?
Here's my code:
db = DBSCAN(eps=0.3,min_samples=10)
db.fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
Epsilon can be hard to choose by "random search".
It's a distance threshold - you need to know what is a typical distance of your time series. Right now, you epdilon clearly is too small, because everything is noise in your data set.
In a map based application, one could know what is a good value, e.g. "1 mile radius". But for your time series, how do distances look like? You might not even know yet, which distance function to use.
In the original DBSCAN paper, the authors proposed a simple method for choosing epsilon, based on a k-distance plot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With