Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DBSCAN with R*-Tree - how it works

Whether someone can explain to me how dbscan algorithm works with R*-Tree? I understand work of dbscan, it seems, I understand as the R*-Tree works, but I can't connect them together.

Initially, I have data - feature vectors with 8 features, and I don't understand how I have to process them for construct R*-Tree. I will be grateful if someone lists the main steps which I have to pass.

I apologize if my question is obvious, but it causes difficulties in me. Thanks in advance!

like image 520
Vladislav Avatar asked Feb 08 '23 13:02

Vladislav


1 Answers

An R*-Tree indexes arbitrary geometric objects by their bounding box. In your case, as you have only points, the minimum and maximum values of your bounding box are the same. Every R*-Tree has a function like rtree.add_element(object, boundingbox). object would be the index of the data point and boundingbox would be as mentioned above.

The connecting point is the regionQuery part of DBSCAN. regionQuery(p) of a data point p returns all the data points q for which euclideanDistance(p,q) ≤ ε (value of parameter ε is provided by the user).

Naïvely, you could compute the distance of all your data points to p, which takes O(n) time for one data point, hence querying all your n data points takes o(n²) time. Alternatively you could precompute a matrix which holds the euclidean distances of all your data points to each other. This then takes O(n²) of space whereas regionQuery of one point just has to be looked up in that matrix.

An R*-Tree enables you to look up data points within coordinate ranges in O(log n) time. However, an R*-Tree only allows queries of the form

"All points where: Coordinate 1 in [ 0.3 ; 0.5 ] AND Coordinate 2 in [ 0.8 ; 1.0 ]"

and not

"All points q where: euclideanDistance(p,q) ≤ ε"

Therefore, you query the R*-Tree for points where each coordinate is the respective corrdinate of p±ε and then calculate the euclidean distance of all matching points to your query point p. The difference, however, is that these are far less points to check than if you would calculate the euclidean distance of p to all of your points. Therefore, your time complexity of one regionQuery is now O(log n * m), where m is the number of points returned by your R*-tree. If you choose ε small, you will get few matching points from your R*-tree and m will be small. So your time complexity approaches O(log n) for one regionQuery and therefore O(n * log n) for one regionQuery for each of your data poins. On the other extreme, if you choose ε so large that it will encompass most of your data points, m will approach n and therefore, the time needed for one regionQuery for each data point approaches O(n * log n * n) = O(n² * log n ) again, so you gain nothing compared to the naïve approach.

It is therefore of crucial importance that you choose ε small enough so that every point has only a few other points within euclidean distance of ε.

like image 157
akraf Avatar answered Feb 16 '23 07:02

akraf