I have a dataset of 38 apartments and their electricity consumption in the morning, afternoon and evening. I am trying to clusterize this dataset using the k-Means implementation from scikit-learn, and am getting some interesting results.
First clustering results:
This is all very well, and with 4 clusters I obviously get 4 labels associated to each apartment - 0, 1, 2 and 3. Using the random_state
parameter of KMeans
method, I can fix the seed in which the centroids are randomly initialized, so consistently I get the same labels attributed to the same apartments.
However, as this specific case is in regards of energy consumption, a measurable classification between the highest and the lowest consumers can be performed. I would like, thus, to assign the label 0 to the apartments with lowest consumption level, label 1 to apartments that consume a bit more and so on.
As of now, my labels are [2 1 3 0], or ["black", "green", "blue", "red"]; I would like them to be [0 1 2 3] or ["red", "green", "black", "blue"]. How should I proceed to do so, while still keeping the centroid initialization random (with fixed seed)?
Thank you very much for the help!
To get the optimal number of clusters for hierarchical clustering, we make use a dendrogram which is tree-like chart that shows the sequences of merges or splits of clusters. If two clusters are merged, the dendrogram will join them in a graph and the height of the join will be the distance between those clusters.
Transforming the labels through a lookup table is a straightforward way to achieve what you want.
To begin with I generate some mock data:
import numpy as np
np.random.seed(1000)
n = 38
X_morning = np.random.uniform(low=.02, high=.18, size=38)
X_afternoon = np.random.uniform(low=.05, high=.20, size=38)
X_night = np.random.uniform(low=.025, high=.175, size=38)
X = np.vstack([X_morning, X_afternoon, X_night]).T
Then I perform clustering on data:
from sklearn.cluster import KMeans
k = 4
kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
And finally I use NumPy's argsort
to create a lookup table like this:
idx = np.argsort(kmeans.cluster_centers_.sum(axis=1))
lut = np.zeros_like(idx)
lut[idx] = np.arange(k)
In [70]: kmeans.cluster_centers_.sum(axis=1)
Out[70]: array([ 0.3214523 , 0.40877735, 0.26911353, 0.25234873])
In [71]: idx
Out[71]: array([3, 2, 0, 1], dtype=int64)
In [72]: lut
Out[72]: array([2, 3, 1, 0], dtype=int64)
In [73]: kmeans.labels_
Out[73]: array([1, 3, 1, ..., 0, 1, 0])
In [74]: lut[kmeans.labels_]
Out[74]: array([3, 0, 3, ..., 2, 3, 2], dtype=int64)
idx
shows the cluster center labels ordered from lowest to highest consumption level. The appartments for which lut[kmeans.labels_]
is 0
/ 3
belong to the cluster with the lowest / highest consumption levels.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With