How to perform cluster with weights/density in python? Something like kmeans with weights?

Tags:

My data is like this:

powerplantname, latitude, longitude, powergenerated
A, -92.3232, 100.99, 50
B, <lat>, <long>, 10
C, <lat>, <long>, 20
D, <lat>, <long>, 40
E, <lat>, <long>, 5

I want to be able to cluster the data into N number of clusters (say 3). Normally I would use a kmeans:

import numpy as np

import matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans2, whiten
coordinates= np.array([
           [lat, long],
           [lat, long],
            ...
           [lat, long]
           ])
x, y = kmeans2(whiten(coordinates), 3, iter = 20)  
plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
plt.show()

The problem with this is it does not account for any weighting (in this case, my powergenerated value) I want to ideally have my clusters taking the value "powergenerated" into account, trying to keep the clusters not only spatially close, but also have close to relatively equal total powergenerated.

Should I be doing this with kmeans (or some other method)? Or is there something else I should be using for this problem that would be better?

322

asked Jul 11 '17 03:07

Rolando

1 Answers

Or is there something else I should be using for this problem that would be better?

In order to take into account simultaneously the geographical distance among centrals and the generated power you should define a proper metric. The function below computes the distance between two points on Earth's surface from their latitudes and longitudes through the haversine formula and adds the absolute value of the generated power difference multiplied by a weighting factor. The value of the weight determines the relative influence of distance and power difference in the clustering process.

import numpy as np

def custom_metric(central_1, central_2, weight=1):
    lat1, lng1, pow1 = central_1
    lat2, lng2, pow2 = central_2

    lat1, lat2, lng1, lng2 = np.deg2rad(np.asarray([lat1, lat2, lng1, lng2]))

    dlat = lat2 - lat1
    dlng = lng2 - lng1
    
    h = (1 - np.cos(dlat))/2. + np.cos(lat1)*np.cos(lat2)*(1 - np.cos(dlng))/2.
    km = 2*6371*np.arcsin(np.sqrt(h))
    
    MW = np.abs(pow2 - pow1)
    
    return km + weight*MW

Should I be doing this with kmeans (or some other method)?

Unfortunately the current implementations of SciPy's kmeans2 and scikit-learn's KMeans only support Euclidean distance. An alternative method would consist in performing hierarchical clustering through the SciPy's clustering package to group the centrals according to the metric just defined.

Demo

Let us first generate mock data, namely feature vectors for 8 centrals with random values:

N = 8
np.random.seed(0)
lat = np.random.uniform(low=-90, high=90, size=N)
lng = np.random.uniform(low=-180, high=180, size=N)
power = np.random.randint(low=5, high=50, size=N)
data = np.vstack([lat, lng, power]).T

The content of variable data yielded by the snippet above looks like this:

array([[   8.7864,  166.9186,   21.    ],
       [  38.7341,  -41.9611,   10.    ],
       [  18.4974,  105.021 ,   20.    ],
       [   8.079 ,   10.4022,    5.    ],
       [ -13.7421,   24.496 ,   23.    ],
       [  26.2609,  153.2148,   40.    ],
       [ -11.2343, -154.427 ,   29.    ],
       [  70.5191, -148.6335,   34.    ]])

To divide those data into three different groups we have to pass data and custom_metric to the linkage function (check the docs to find out more on parameter method), and then pass the returned linkage matrix to the cut_tree function with n_clusters=3.

from scipy.cluster.hierarchy import linkage, cut_tree
Z = linkage(data, method='average', metric=custom_metric)
y = cut_tree(Z, 3).flatten()

As a result we get the group membership (array y) for each central:

array([0, 1, 0, 2, 2, 0, 0, 1])

The results above depend on the value of weight. If you wish to use a value different to 1 (for example 250) you can change the default value like this:

def custom_metric(central_1, central_2, weight=250):

Alternatively, you could set the parameter metric in the call to linkage to a lambda expression as follows: metric=lambda x, y: custom_metric(x, y, 250).

Finally, to gain a deeper insight into the hierarchical/agglomerative clustering you could plot it as a dendrogram:

from scipy.cluster.hierarchy import dendrogram
dendrogram(Z)

dendrogram

151

answered Sep 21 '22 18:09

Tonechas

Related questions
                            
                                Sqlalchemy - add columns to a query
                            
                                What does scipy.signal.convolve2d calculate? [duplicate]
                            
                                python garbage collection about list append itself [duplicate]
                            
                                Python namedtuple as argument to apply_async(..) callback
                            
                                timezone aware datetime objects in django templates
                            
                                What means the serialize=False on Primary-key field?
                            
                                how to get parse tree using python nltk?
                            
                                Immutability in Python [duplicate]
                            
                                Which one is more secure to use? uuid, binascii.hexlify(os.urandom()) or random.SystemRandom()?
                            
                                Is overloading broken in cppclass Cython/C++ definitions?
                            
                                Reset default matplotlib colormap values after using 'set_under' or 'set_over'
                            
                                Virtualenv activate script won't run in bash script with set -euo
                            
                                Wordcloud Python with generate_from_frequencies
                            
                                Python to mysql 'Timestamp' object has no attribute 'translate'
                            
                                Logistic Regression: How to find top three feature that have highest weights?
                            
                                Python pandas load csv ANSI Format as UTF-8
                            
                                Python Strings are immutable so why does s.split( ) return a list of new strings
                            
                                Side Effects in Python
                            
                                How can I jump to the cell currently being run in a Jupyter notebook?
                            
                                Django - unavailable field of model while doing migration

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to perform cluster with weights/density in python? Something like kmeans with weights?

Tags:

python

algorithm

scipy

cluster-analysis

scikit-learn

Rolando

People also ask

1 Answers

Demo

Tonechas

Recent Activity

Donate For Us