I am interested in detecting clusters in areas with varying-density, such as user-generated data in cities, and for that I adopted the OPTICS algorithm.
Unlike DBSCAN, the OPTICS algorithm does not produce a strict cluster partition, but an augmented ordering of the database. To produce the cluster partition, I use OPTICSxi, which is another algorithm that produces a classification based on the output of OPTICS. There are few libraries capable of extracting a cluster partition from the output of OPTICS, and ELKI’s OPTICSxi implementation is one of them.
It is very clear to me, how-to interpret the results of DBSCAN (although it is not that easy, to set “meaningful” global parameters); DBSCAN detects a “prototype” of a cluster, characterized by a density, expressed as a number of points per area (minpts/epsilon). The results of OPTICSxi seem a bit more difficult to interpret.
There are two phenomena that I sometimes detect in the outputs of OPTICSxi, and that I am not able to explain. One is the appearance of “spike” clusters, that link parts of the map. I cannot explain them, because they seem to be made of very few points and I don’t understand how the algorithm decides to group them in the same cluster. Do they really represent a “corridor” of density variation? looking at the underlying data, it does not look like that. You can see these “spikes” in the image bellow.
The other phenomenon that I cannot explain is the fact that sometimes there are "overlapping" clusters of the same hierarchical level. OPTICSxi is based on the OPTICS ordering of the database (e.g. dendrogram) and there are no repeated points in that diagram.
Since this is a hierarchical clustering, we consider that clusters of a lower level contain clusters of a higher level, and that idea is enforced when building the convex hulls. However, I don’t see any justification for having clusters that intersect other clusters on the same hierarchical level, which in practice would mean that some points would have a double cluster “membership”. On the image bellow, we can see some intersecting clusters with the same hierarchical level (0).
Finally the most important thought/question that I want to leave you with, is: what do we expect to see in an OPTICSxi clustering classification? This question is closely linked to the task of parametrizing OPTICSxi.
Since I see hardly any studies with runs of OPTICSxi for a particular cluster problem, I struggle to find what is an optimal clustering classification would be; i.e.: one that can provide some meaningful/useful results, and add some value to the DBSCAN clustering. To help me answering that question, I performed many runs of OPTICSxi, with different combinations of parameters, and I selected three that I will discuss bellow.
On this run I used a large value of epsilon (2Km); the meaning of that value is that we accept large clusters (up to 2Km); since the algorithm “merges” clusters, we will end up with some very large clusters, that will have almost certainly a low density. I like this output, because it exposes the hierarchical structure of the classification, and it actually reminds me of several runs DBSCAN with a different combination of parameters (for different densities), which is the advertised “strength” of OPTICS. As it was mentioned before, smaller clusters correspond to higher levels in the hierarchical scale, and higher densities.
On this run we see a large number of clusters, even if the “contrast” parameter is the same from the previous run. That is mostly because I chosen a low number of minpts, which established that we accept clusters with a low number of points. Since the epsilon in this case is shorter, we don’t see these large clusters occupying a large part of the map. I find this output less interesting than the previous one, mostly because, even if we have an hierarchical structure there are many clusters at the same level, and many of them intersect. In terms of interpretation, I can see an overall “shape” that is similar to the previous one, but it is actually discretized in lots of small clusters that are easily overlooked as “noise”.
This run has a parameter choice that is similar to the previous one, except that the minpts is larger; the consequences is that not only we find less clusters and they overlap less, but also that they are mostly at the same level.
In a perspective of adding value to DBSCAN, I would opt for the first combination of parameters, since it provides a hierarchical picture of the data, exposing clearly which areas are more dense. IMHO the last combination of parameters, fails to provide an idea of the global distribution of density, since it is finding similar clusters all over the study area. I am interested to read other opinions.
The problem with extracting clusters from the OPTICS plot is the first and last elements of a clsuter. Just from the plot, you cannot (to my understanding) decide whether the last element should belong to the previous cluster or not.
Consider a plot like this
*
* *
* *
* **
**************
A B C D EF G H
This can be a cluster where A is right in the middle, B-E nearby, and F is the nearest element in a completely different cluster. For example, the data set might look like this:
* D *
B A E F G
* C H *
Or, A is at the rim of the first cluster, B-D are part of the cluster, whereas E is an outlier element bridging the gap to the cluster F-H. A data set that causes such an effect could look like this:
D * *
* C B A E F G
E * H *
OpticsXi operates visually. F is the "steeper" point to split, so E will in each case be part of the first cluster. It is literally the best guess OpticsXi can do without looking at the data points.
This is likely the effect causing the spikes you have been observing.
I see four options:
improve OpticsXi yourself. If you are interested, we can discuss some heuristics possible to distinguish these two cases above.
implement one of the other extraction methods, such as inflexion points (but they may suffer from the same effects, als they are in the plot AFAICT)
use HDBSCAN (sorry, not yet included in ELKI, although we have a version that appears to be working) - probably in 0.7.0
Apply post-processing to the clusters. In particular, test the first and last few points by cluster order, if you want to include them in the cluster, move them to the next, or move them to the parent cluster. Maybe simply by average distance from the cluster...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With