Sklearn.KMeans() : Get class centroid labels and reference to a dataset

Sci-Kit learn Kmeans and PCA dimensionality reduction

I have a dataset, 2M rows by 7 columns, with different measurements of home power consumption with a date for each measurement.

date,
Global_active_power,
Global_reactive_power,
Voltage,
Global_intensity,
Sub_metering_1,
Sub_metering_2,
Sub_metering_3

I put my dataset into a pandas dataframe, selecting all columns but the date column, then perform cross validation split.

import pandas as pd
from sklearn.cross_validation import train_test_split

data = pd.read_csv('household_power_consumption.txt', delimiter=';')
power_consumption = data.iloc[0:, 2:9].dropna()
pc_toarray = power_consumption.values
hpc_fit, hpc_fit1 = train_test_split(pc_toarray, train_size=.01)
power_consumption.head()

power table

I use K-means classification followed by PCA dimensionality reduction to display.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

hpc = PCA(n_components=2).fit_transform(hpc_fit)
k_means = KMeans()
k_means.fit(hpc)

x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1
y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
          extent=(xx.min(), xx.max(), yy.min(), yy.max()),
          cmap=plt.cm.Paired,
          aspect='auto', origin='lower')

plt.plot(hpc[:, 0], hpc[:, 1], 'k.', markersize=4)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='w', zorder=8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

PCA output

Now I would like to find out which rows fell under a given class then which dates fell under a given class.

Is there any way to relate the points on the graph to an index in my dataset, after PCA?
Some method I don't know of?
Or is my approach fundamentally flawed?
Any recommendations?

I am fairly new to this field and am trying to read through lots of code, this is a compilation of several examples I've seen documented .

My goal is to classify the data and then get the dates that fall under a class.

Thank You

588

asked Dec 16 '14 12:12

flow

1 Answers

KMeans().predict(X) ..docs here

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters: (New data to predict)

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Returns: (Index of the cluster each sample belongs to)  

labels : array, shape [n_samples,]

The problem I with the code you submitted is the use of

train_test_split()

which returns two arrays of random rows in your data-set, effectively ruining your dataset order making it difficult to correlate the labels returned from KMeans classification to sequential dates in your data set.

Here's an example:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

#read data into pandas dataframe
df = pd.read_csv('household_power_consumption.txt', delimiter=';')

Raw Dataset head

#convert merge date and time colums and convert to datetime objects
df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df.set_index(pd.DatetimeIndex(df['Datetime'],inplace=True))
df.drop(['Date','Time'], axis=1, inplace=True)

#put last column first
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
df = df.dropna()

preprocessed dates

#convert dataframe to data array and removes date column not to be processed, 
sliced = df.iloc[0:, 1:8].dropna()
hpc = sliced.values

k_means = KMeans()
k_means.fit(hpc)

# array of indexes corresponding to classes around centroids, in the order of your dataset
classified_data = k_means.labels_

#copy dataframe (may be memory intensive but just for illustration)
df_processed = df.copy()
df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)

Finished

Now you can see your result matched with your data-set on the right side.
Now that it's classified, it's up to you to derive meaning.
This is just a good overall example of how it can be used, from start to finish.
Displaying your result, look at PCA or making other graphs dependent on class.

answered Oct 18 '22 06:10

flow

Related questions
                            
                                Sending messages from other languages to an IPython kernel
                            
                                How do you cleanly remove Python when it was installed with 'make altinstall'?
                            
                                Wrong numpy mean value?
                            
                                matplotlib contour plot: proportional colorbar levels in logarithmic scale
                            
                                Python, replace long dash with short dash?
                            
                                Equivalent of "in" keyword or subquery in pandas
                            
                                Implementation of NoneType, Reasons and Details
                            
                                how do I redraw an image using python's matplotlib?
                            
                                How to apply hierarchy or multi-index to pandas columns
                            
                                Make a Pandas MultiIndex from a product of iterables?
                            
                                Fastest way to load numeric data into python/pandas/numpy array from MySQL
                            
                                Python: solving unicode hell with unidecode
                            
                                Pyplot: Shared axes and no space between subplots
                            
                                What is the point of `cursor` class in psycopg?
                            
                                Python coordinate transformation ECI to ECEF
                            
                                AttributeError: 'NoneType' object has no attribute 'split'
                            
                                Difference between `yield from foo()` and `for x in foo(): yield x`
                            
                                Mac - Python - import error: "No module named site"
                            
                                creating a boolean array which compares numpy elements to None
                            
                                How do I use re.search starting from a certain index in the string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sklearn.KMeans() : Get class centroid labels and reference to a dataset

Tags:

python

date

k-means

svm

pca

Sci-Kit learn Kmeans and PCA dimensionality reduction

flow

People also ask

1 Answers

KMeans().predict(X) ..docs here

Predict the closest cluster each sample in X belongs to.

flow

Recent Activity

Donate For Us