Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plot KMeans clusters and classification for 1-dimensional data

I am using KMeans to cluster the three time-series datasets with different characterstics. For reproducibility reasons, I am sharing the data here.

Here is my code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

protocols = {}

types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }



k_means = KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)
k_means.fit(quotient.reshape(-1,1))

This way, given a new data point (with quotient and quotient_times), I want to know which cluster it belongs to by building each dataset stacking these two transformed features quotient and quotient_times with KMeans.

k_means.labels_ gives this output array([1, 1, 0, 1, 2, 1, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int32)

Finally, I want to visualize the clusters using plt.plot(k_means, ".",color="blue") but I am getting this error: TypeError: float() argument must be a string or a number, not 'KMeans'. How do we plot KMeans clusters?


1 Answers

What you're effectively looking for is a range of values between which points are considered to be in a given class. It's quite unusual to use KMeans to classify 1d data in this way, although it certainly works. As you've noticed you need to convert your input data to a 2d array in order to use the method.

k_means = KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

quotient_2d = quotient.reshape(-1,1)
k_means.fit(quotient_2d)

You will need the quotient_2d again for the classification (prediction) step later.

First we can plot the centroids, since the data is 1d the x-axis point is arbitrary.

colors = ['r','g','b']
centroids = k_means.cluster_centers_
for n, y in enumerate(centroids):
    plt.plot(1, y, marker='x', color=colors[n], ms=10)
plt.title('Kmeans cluster centroids')

This produces the following plot.

cluster centroids

To get cluster membership for the points, pass quotient_2d to .predict. This returns an array of numbers for class membership, e.g.

>>> Z = k_means.predict(quotient_2d)
>>> Z
array([1, 1, 0, 1, 2, 1, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int32)

We can use this to filter our original data, plotting each class in a separate color.

# Plot each class as a separate colour
n_clusters = 3 
for n in range(n_clusters):
    # Filter data points to plot each in turn.
    ys = quotient[ Z==n ]
    xs = quotient_times[ Z==n ]

    plt.scatter(xs, ys, color=colors[n])

plt.title("Points by cluster")

This generates the following plot with the original data, each point coloured by the cluster membership.

points coloured by cluster

like image 189
mfitzp Avatar answered Nov 29 '25 17:11

mfitzp