I have a dataset with 1000 dimensions and I am trying to cluster the data with DBSCAN in Python. I have a hard time understanding what metric to choose and why.
Can someone explain this? And how should I decide what values to set eps
to?
I am interested in the finer structure of the data so the min_value
is set to 2. Now I use the regular metric that is preset for dbscan in sklearn, but for small eps values, such as eps
< 0.07, I get a few clusters but miss many points and for larger values i get several smaller clusters and one huge. I do understand that everything depends on the data at hand but I am interested in tips on how to choose eps values in a coherent and structured way and what metrics to choose!
I have read this question and the answers there are with regards to 10 dimensions I have 1000 :) and I also do not know how to evaluate my metric so it would be interesting with a more elaborate explanation then: evaluate your metric!
Edit: Or tips on other clustering algorithms that work on high dimensional data with an existing python implementation.
While DBSCAN is great at separating high density clusters from low density clusters, DBSCAN struggles with clusters of similar density. Struggles with high dimensionality data. I know, this entire article I have stated how DBSCAN is great at contorting the data into different dimensions and shapes.
Graph-based clustering (Spectral, SNN-cliq, Seurat) is perhaps most robust for high-dimensional data as it uses the distance on a graph, e.g. the number of shared neighbors, which is more meaningful in high dimensions compared to the Euclidean distance.
It has DBSCAN, and it can do three dimensions, too.
First of all, with minPts=2
you aren't actually doing DBSCAN clustering, but the result will degenerate into single-linkage clustering.
You really should use minPts=10
or higher.
Unfortunately, you didn't bother to tell us what distance metric you actually use!
Epsilon really depends heavily on your data set and metric. We cannot help you there without knowing the parameters and your data set. Have you tried plotting a distance histogram to see which values are typical? That probably is the best heuristic to choose this threshold: look at quantiles of the distance histogram (or a sample thereof).
However, note that OPTICS does get rid of this parameter (at least when you have a proper implementation). When extracting clusters with the Xi method, you only need epsilon large enough to not cut structure you are interested in (and small enough to get the runtime you want - larger is slower, although not linearly). Xi then gives a relative increase in distance that is considered to be significant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With