Cluster high dimensional data with python and DBSCAN

Q: Does DBSCAN work well with high-dimensional data?

While DBSCAN is great at separating high density clusters from low density clusters, DBSCAN struggles with clusters of similar density. Struggles with high dimensionality data. I know, this entire article I have stated how DBSCAN is great at contorting the data into different dimensions and shapes.

Q: Which clustering algorithm is best for high-dimensional data?

Graph-based clustering (Spectral, SNN-cliq, Seurat) is perhaps most robust for high-dimensional data as it uses the distance on a graph, e.g. the number of shared neighbors, which is more meaningful in high dimensions compared to the Euclidean distance.

Q: How many dimensions can DBSCAN handle?

It has DBSCAN, and it can do three dimensions, too.

Tags:

python

cluster-analysis

data-mining

dbscan

n-dimensional

I have a dataset with 1000 dimensions and I am trying to cluster the data with DBSCAN in Python. I have a hard time understanding what metric to choose and why.

Can someone explain this? And how should I decide what values to set eps to?

I am interested in the finer structure of the data so the min_value is set to 2. Now I use the regular metric that is preset for dbscan in sklearn, but for small eps values, such as eps < 0.07, I get a few clusters but miss many points and for larger values i get several smaller clusters and one huge. I do understand that everything depends on the data at hand but I am interested in tips on how to choose eps values in a coherent and structured way and what metrics to choose!

I have read this question and the answers there are with regards to 10 dimensions I have 1000 :) and I also do not know how to evaluate my metric so it would be interesting with a more elaborate explanation then: evaluate your metric!

Edit: Or tips on other clustering algorithms that work on high dimensional data with an existing python implementation.

228

asked Apr 22 '13 14:04

Ekgren

1 Answers

First of all, with minPts=2 you aren't actually doing DBSCAN clustering, but the result will degenerate into single-linkage clustering.

You really should use minPts=10 or higher.

Unfortunately, you didn't bother to tell us what distance metric you actually use!

Epsilon really depends heavily on your data set and metric. We cannot help you there without knowing the parameters and your data set. Have you tried plotting a distance histogram to see which values are typical? That probably is the best heuristic to choose this threshold: look at quantiles of the distance histogram (or a sample thereof).

However, note that OPTICS does get rid of this parameter (at least when you have a proper implementation). When extracting clusters with the Xi method, you only need epsilon large enough to not cut structure you are interested in (and small enough to get the runtime you want - larger is slower, although not linearly). Xi then gives a relative increase in distance that is considered to be significant.

131

answered Oct 17 '22 22:10

Has QUIT--Anony-Mousse

Related questions
                            
                                django error: ImproperlyConfigured: WSGI application
                            
                                Python regex - difference between search and find all
                            
                                Importing Modules - How Much is Too Much?
                            
                                sqlalchemy how to using AND in OR operation?
                            
                                "extended" IFFT
                            
                                Programmatically add spacing to the side of plots in Matplotlib
                            
                                Multiple y-scales but only one enabled for pan and zoom
                            
                                static_url calling in Tornado
                            
                                Finding the surrounding sentence of a char/word in a string
                            
                                how to perform an inner or outer join of DataFrames with Pandas on non-simplistic criterion
                            
                                python import sqlite error
                            
                                Calculating the first triangle number to have over 500 divisors in python
                            
                                How to Install openCV into Enthought python distribution on Mac
                            
                                Summing values of 2D array on indices
                            
                                Why would a python framework installation guide advise the use of easy_install for some required packages and pip for others?
                            
                                Example of touch event with webdriver python?
                            
                                Compression of existing file using h5py
                            
                                easy_install or pip as a limited user?
                            
                                Fool python's os.isatty from a bash script
                            
                                Python extremely puzzling regex unicode behaviour

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With