The problem I have is that there are rectangles within rectangles. Think of a map, except with the following traits with the key point being: rectangles with similar density often share similar dimensions and similar position on the x axis with other rectangles, but sometimes the distance between these rectangles may be big but usually small. If the position on x axis or dimensions are clearly way off, they would not be similar.
rectangles do not intersect, smaller rectangles are completely inside a larger rectangle.
rectangles often have similar x position and similar dimensions (similar height and width), and have smaller rectangles inside it. The rectangle itself would be considered a cluster of it's own.
Sometimes the distance of these cluster from another cluster may be quite big (think of islands). Often these clusters share the same or similar dimension and same or similar density of sub rectangles. if so, they should be considered as part of the same cluster despite a distance between the two clusters.
I've attached a diagram to describe the situation more clearly:
Red border means those groups are outliers, not part of any cluster and are ignored.
Blue border has many clusters (black borders containing black solid rectangles). They form a group of clusters that are similar due to the criteria mentioned above (similar width, similar X position, similar density). Even the clusters towards the bottom right corner is still considered part of this group because of the criteria (similar width, similar X position, similar density).
Turquois border has many clusters (black borders containing black solid rectangles). However, these clusters differ in dimension, x position, and density, from the ones in Blue border. They are considered a group of their own.
So far I found density clustering such as DBSCAN which seems to be perfect since it takes noise (outliers) into consideration, and you do not need to know ahead of time how many clusters there will be.
However, you need to define the minimum number of points needed to form a cluster and a threshold distance. What happens if you don't know these two and it can vary based on the problem described above?
Another seemingly plausible solution would be hierarchial (agglomerative) clustering (r-tree), but I'm concerned that I would still need to know the cutoff point in the tree depth level to determine if it's a cluster.
You certainly will need to take all your constraints into account.
In general, your task looks more like constraint satisfaction to me than clustering.
Maybe some constraint clustering approaches are useful to you, but I'm not sure if they allow your kind of constraints. Usually, they only support must-link and must-not-link constraints.
But of course you should try DBSCAN (in particular also: generalized DBSCAN, since the generalization might allow you to add the constraints you have!) and R-trees (which aren't actually a clustering algorithm, but a data index).
Note that R-trees will put the "outliers" into some leaf, to ensure minimum fill.
As is, I cannot give you more detailed recommendations, because even from above sketch, your constraints are not well defined IMHO. Try putting them into pseudocode. You probably only have a small number of rectangles (say, 100); so you can afford to run really expensive algorithms, such as linkage clustering with a customized linkage criterion. Putting your criterions into code may already be 99% of the effort!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With