Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find the "elbow point" on an optimization curve with Python

i have a list of points which are the inertia values of a kmeans algorithm.
To determine the optimum amount of clusters i need to find the point, where this curve starts to flatten.

Data example

Here is how my list of values is created and filled:

sum_squared_dist = []
K = range(1,50)
for k in K:
    km = KMeans(n_clusters=k, random_state=0)
    km = km.fit(normalized_modeling_data)
    sum_squared_dist.append(km.inertia_)

print(sum_squared_dist)

How can i find a point, where the pitch of this curve increases (the curve is falling, so the first derivation is negative)?

My approach

derivates = []
for i in range(len(sum_squared_dist)):
    derivates.append(sum_squared_dist[i] - sum_squared_dist[i-1])

I want to find the optimum number of clusters any given data using the elbow method. Could someone help me how i can find the point where the list of the inertia values starts to flatten?

Edit
Datapoints:

[7342.1301373073857, 6881.7109460930769, 6531.1657905495022,  
6356.2255554679778, 6209.8382535595829, 6094.9052166741121, 
5980.0191582610196, 5880.1869867848218, 5779.8957906367368, 
5691.1879324562778, 5617.5153566271356, 5532.2613232619951, 
5467.352265375117, 5395.4493783888756, 5345.3459908298091, 
5290.6769823693812, 5243.5271656371888, 5207.2501206569532, 
5164.9617535255456]

Graph: graph

like image 956
ItFreak Avatar asked Aug 09 '18 08:08

ItFreak


People also ask

What is elbow method in Python?

The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then for each value of k computes an average score for all clusters. By default, the distortion score is computed, the sum of square distances from each point to its assigned center.

How do you find the elbow method?

Elbow MethodWCSS is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1.

What is the elbow curve?

In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.


1 Answers

I worked on a Python package modeled after the Kneedle algorithm. It finds x=5 as the point where the curve starts to flatten. The documentation and the paper discuss the algorithm for choosing the knee point in more detail.

y = [7342.1301373073857, 6881.7109460930769, 6531.1657905495022,  
6356.2255554679778, 6209.8382535595829, 6094.9052166741121, 
5980.0191582610196, 5880.1869867848218, 5779.8957906367368, 
5691.1879324562778, 5617.5153566271356, 5532.2613232619951, 
5467.352265375117, 5395.4493783888756, 5345.3459908298091, 
5290.6769823693812, 5243.5271656371888, 5207.2501206569532, 
5164.9617535255456]

x = range(1, len(y)+1)

from kneed import KneeLocator
kn = KneeLocator(x, y, curve='convex', direction='decreasing')
print(kn.knee)
5

import matplotlib.pyplot as plt
plt.xlabel('number of clusters k')
plt.ylabel('Sum of squared distances')
plt.plot(x, y, 'bx-')
plt.vlines(kn.knee, plt.ylim()[0], plt.ylim()[1], linestyles='dashed')

knee point

like image 102
Kevin Avatar answered Oct 26 '22 16:10

Kevin