i have a list of points which are the inertia values of a kmeans algorithm.
To determine the optimum amount of clusters i need to find the point, where this curve starts to flatten.
Data example
Here is how my list of values is created and filled:
sum_squared_dist = []
K = range(1,50)
for k in K:
km = KMeans(n_clusters=k, random_state=0)
km = km.fit(normalized_modeling_data)
sum_squared_dist.append(km.inertia_)
print(sum_squared_dist)
How can i find a point, where the pitch of this curve increases (the curve is falling, so the first derivation is negative)?
My approach
derivates = []
for i in range(len(sum_squared_dist)):
derivates.append(sum_squared_dist[i] - sum_squared_dist[i-1])
I want to find the optimum number of clusters any given data using the elbow method. Could someone help me how i can find the point where the list of the inertia values starts to flatten?
Edit
Datapoints:
[7342.1301373073857, 6881.7109460930769, 6531.1657905495022,
6356.2255554679778, 6209.8382535595829, 6094.9052166741121,
5980.0191582610196, 5880.1869867848218, 5779.8957906367368,
5691.1879324562778, 5617.5153566271356, 5532.2613232619951,
5467.352265375117, 5395.4493783888756, 5345.3459908298091,
5290.6769823693812, 5243.5271656371888, 5207.2501206569532,
5164.9617535255456]
Graph:
The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then for each value of k computes an average score for all clusters. By default, the distortion score is computed, the sum of square distances from each point to its assigned center.
Elbow MethodWCSS is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1.
In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.
I worked on a Python package modeled after the Kneedle algorithm. It finds x=5
as the point where the curve starts to flatten. The documentation and the paper discuss the algorithm for choosing the knee point in more detail.
y = [7342.1301373073857, 6881.7109460930769, 6531.1657905495022,
6356.2255554679778, 6209.8382535595829, 6094.9052166741121,
5980.0191582610196, 5880.1869867848218, 5779.8957906367368,
5691.1879324562778, 5617.5153566271356, 5532.2613232619951,
5467.352265375117, 5395.4493783888756, 5345.3459908298091,
5290.6769823693812, 5243.5271656371888, 5207.2501206569532,
5164.9617535255456]
x = range(1, len(y)+1)
from kneed import KneeLocator
kn = KneeLocator(x, y, curve='convex', direction='decreasing')
print(kn.knee)
5
import matplotlib.pyplot as plt
plt.xlabel('number of clusters k')
plt.ylabel('Sum of squared distances')
plt.plot(x, y, 'bx-')
plt.vlines(kn.knee, plt.ylim()[0], plt.ylim()[1], linestyles='dashed')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With