I have 3 dimensional array of integers (~4000 x 6000 x 3) that I need to classify in a particular way. I'm hoping for something like a k-means clustering method, but instead of inputting the number of clusters I would like to input the maximum radius of a cluster.
Said another way, given a sphere of a defined size, I would like to find the minimum number of non-empty spheres that will cover all the data and classify the points accordingly.
Though I have little knowledge of the field, I have been looking into clustering algorithms for a little while and have not found one that accomplishes this.
For example, given a random dataset:
import numpy as np
randomArray = np.random.rand(10,10,3)*500
Out[8]:
array([[[ 256.68932025, 153.07151992, 196.19477623],
[ 48.05542231, 346.1289173 , 327.44694932],
[ 427.87340594, 197.26882283, 402.41558648],
[ 192.50462233, 408.31800086, 81.66016443],
[ 64.15373494, 34.96971099, 446.55362458]],
[[ 376.55003794, 70.09514697, 242.08947306],
[ 194.86207829, 379.90969257, 439.47043484],
[ 102.99922513, 98.57769429, 415.5059223 ],
[ 464.65318671, 223.60061654, 417.52758666],
[ 53.68383153, 205.32517075, 299.83858164]],
[[ 364.80957874, 14.26150931, 264.01568428],
[ 295.75617954, 107.52678922, 87.89830525],
[ 57.90617865, 409.54132373, 54.36940482],
[ 217.35951975, 345.7892723 , 301.07031811],
[ 295.98999071, 27.17772152, 182.58776469]],
[[ 291.69513153, 466.03218019, 279.25794618],
[ 179.60152364, 161.64966386, 269.34221176],
[ 374.78609278, 259.18286321, 459.8037004 ],
[ 458.51249648, 87.05600447, 268.12588338],
[ 152.54500603, 472.36773898, 1.15894726]],
[[ 35.43731854, 163.42770568, 131.77683448],
[ 14.36039625, 390.33409364, 314.56443073],
[ 71.47211566, 109.78613335, 345.57021076],
[ 74.62340632, 328.51303903, 341.97344285],
[ 205.02121677, 235.71812371, 32.91252756]]])
and a maximum radius
R = 35
I would like to output a 2D array of the same height and width, with labels that represent the sphere each point in the 3D array was classified in. No two points with the same label should have a euclidian distance greater than the maximum radius, no sphere should be empty and the algorithm should use the minimum number of spheres possible to accomplish this.
You can use a variation of Affinity based clustering in scikit learn; basic guide, example, cluster object. You will have to modify the distance metric / matrix to ensure the sphere radius is less than the max radius. Affinity and Density Based (DBSCAN) clustering are the only unsupervised machine learning methods that do not require a predefined number of clusters.
Affinity Clustering Image
The reason I don't recommend DBSCAN is there will be points that are not part of a cluster. With Affinity clustering all the points will be part of a cluster. Affinity clustering can take a long time to converge, O(N^2 * T) where 'N' is the number of samples and 'T' is the number of iterations to converge. The 'signal' from Affinity clustering can represent your radius. Since the default affinity
, which represents how the signal is calculated between points, is a variation of euclidean distance you have a jump start on calculating the clusters.
affinity : string, optional, default='euclidean'
Which affinity to use. At the moment precomputed and euclidean are supported. euclidean uses the negative squared euclidean distance between points.
Additionally, you can weight different points using the preference
input. As you iterate / converge on the number of clusters you will have look at the affinity_matrix_
and cluster_centers_
attributes to determine if all the points are within your max radius. The preference for each point can change as you iterate through the different variations of clustering to indicate which cluster the point is most likely to be a part of.
preference : array-like, shape (n_samples,) or float, optional
Preferences for each point - points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, ie of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities.
If you wanted to try DBSCAN the three main metrics are the EPS (max distance between two samples), minimum number of points to make a cluster and the distance metric DBSCAN cluster object, basic guide, example.
DBSCAN Clustering Image
You may not have to iterate as much through DBSCAN as you would with Affinity. You would be able to test quickly if the max distance choosen is a viable option since the cluster label in DBSCAN will be -1
if the sample is not part of a cluster at that distance. Your memory consumption will be high with large datasets.
There are three types of points / samples in the DBSCAN algorithm:
a sample in the dataset such that there exist
min_samples
other samples within a distance ofeps
are samples that are neighbors of a core sample in the cluster but are not themselves core samples; translation is this sample is within the
eps
distance but doesn't have themin_samples
surrounding it (less thaneps
distance)
is not a core sample, and is at least
eps
in distance from any core sample
The DBSCAN clusters are defined by the core samples only. This can get confusing but scikit learn does a pretty good job of explaining the algorithm.
I don't think there is an easy implementation to your problem. Both Affinity and DBSCAN have their pro's and con's. I think Affinity would be your best bet initially because DBSCAN can get confusing and may give you a memory error depending on the size of your sample matrix.
I hope this helps. It's a very interesting question. I would try a variation of Affinity to initially solve this problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With