Choosing eps and minpts for DBSCAN (R)?

Tags:

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as follows:

library(fpc) ds <- dbscan(USArrests,eps=20)

Choosing eps was merely by trial and error in this case. However I am wondering if there is a function or code available to automate the choice of the best eps/minpts. I know some books recommend producing a plot of the kth sorted distance to its nearest neighbour. That is, the x-axis represents "Points sorted according to distance to kth nearest neighbour" and the y-axis represents the "kth nearest neighbour distance".

This type of plot is useful for helping choose an appropriate value for eps and minpts. I hope I have provided enough information for someone to be help me out. I wanted to post a pic of what I meant however I'm still a newbie so can't post an image just yet.

855

asked Oct 15 '12 10:10

Belinda Chiera

1 Answers

There is no general way of choosing minPts. It depends on what you want to find. A low minPts means it will build more clusters from noise, so don't choose it too small.

For epsilon, there are various aspects. It again boils down to choosing whatever works on this data set and this minPts and this distance function and this normalization. You can try to do a knn distance histogram and choose a "knee" there, but there might be no visible one, or multiple.

OPTICS is a successor to DBSCAN that does not need the epsilon parameter (except for performance reasons with index support, see Wikipedia). It's much nicer, but I believe it is a pain to implement in R, because it needs advanced data structures (ideally, a data index tree for acceleration and an updatable heap for the priority queue), and R is all about matrix operations.

Naively, one can imagine OPTICS as doing all values of Epsilon at the same time, and putting the results in a cluster hierarchy.

The first thing you need to check however - pretty much independent of whatever clustering algorithm you are going to use - is to make sure you have a useful distance function and appropriate data normalization. If your distance degenerates, no clustering algorithm will work.

123

answered Sep 21 '22 04:09

Has QUIT--Anony-Mousse

Related questions
                            
                                how to create a loop that includes both a code chunk and text with knitr in R
                            
                                ggplot geom_bar: meaning of aes(group = 1)
                            
                                How to make a list of integer vectors in R
                            
                                How can I change the title of a ggplot2 legend?
                            
                                Save ggplot within a function
                            
                                possible to create latex multicolumns in xtable?
                            
                                Error in file(file, "rt") : invalid 'description' argument in complete.cases program
                            
                                Debugging lapply/sapply calls
                            
                                grepl: Search within a string that does not contain a pattern
                            
                                Calculate group mean, sum, or other summary stats. and assign column to original data
                            
                                How to write a "reader-friendly" sessionInfo() to text file
                            
                                How to specify lib directory when installing development version R Packages from github repository
                            
                                NAMESPACE not generated by roxygen2. Skipped. - Confusion with Hadley book
                            
                                Reverse stacked bar order
                            
                                Unnest a list column directly into several columns
                            
                                Create new column based on 4 values in another column
                            
                                Getting a row from a data frame as a vector in R
                            
                                use multiple columns as variables with sapply
                            
                                Convert dataframe column to 1 or 0 for "true"/"false" values and assign to dataframe
                            
                                Plot normal, left and right skewed distribution in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Choosing eps and minpts for DBSCAN (R)?

Tags:

r

cluster-analysis

data-mining

dbscan

Belinda Chiera

People also ask

1 Answers

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us