Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PCA output looks weird for a kmeans scatter plot

After doing PCA on my data and plotting the kmeans clusters, my plot looks really weird. The centers of the clusters and scatter plot of the points do not make sense to me. Here is my code:

#clicks, conversion, bounce and search are lists of values.
clicks=[2,0,0,8,7,...]
conversion = [1,0,0,6,0...]
bounce = [2,4,5,0,1....]

X = np.array([clicks,conversion, bounce]).T
y = np.array(search)

num_clusters = 5

pca=PCA(n_components=2, whiten=True)
data2D = pca.fit_transform(X)

print data2D
    >>> [[-0.07187948 -0.17784291]
     [-0.07173769 -0.26868727]
     [-0.07173789 -0.26867958]
     ..., 
     [-0.06942414 -0.25040886]
     [-0.06950897 -0.19591147]
     [-0.07172973 -0.2687937 ]]

km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(X)

labels=km.labels_
centers2D = pca.fit_transform(km.cluster_centers_)

colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]

plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],  marker='x', c='r')
plt.show()

The red crosses are the center of the clusters. Any help would be great. enter image description here

like image 771
jxn Avatar asked Jul 01 '15 01:07

jxn


2 Answers

Your ordering of PCA and KMeans is screwing things up...

Here is what you need to do:

  1. Normalize your data.
  2. Perform PCA on X to reduce the dimensions from 5 to 2 and produce Data2D
  3. Normalize again
  4. Cluster Data2D with KMeans
  5. Plot the Centroids on top of Data2D.

Where as, here is what you have done above:

  1. Perform PCA on X to reduce the dimensions from 5 to 2 to produce Data2D
  2. Cluster the original data, X, in 5 dimensions.
  3. Perform a separate PCA on your cluster centroids, which produces a completely different 2D subspace for the centroids.
  4. Plot the PCA reduced Data2D with the PCA reduced centroids on top even though these no longer are coupled properly.

Normalization:

Take a look at the code below and you'll see that it puts the centroids right where they need to be. The normalization is key and is completely reversible. ALWAYS normalize your data when you cluster as the distance metrics need to move through all of the spaces equally. Clustering is one of the most important times to normalize your data, but in general... ALWAYS NORMALIZE :-)

A heuristic discussion that goes beyond your original question:

The entire point of dimensionality reduction is to make the KMeans clustering easier and to project out dimensions which don't add to the variance of the data. So you should pass the reduced data to your clustering algorithm. I'll add that there are very few 5D datasets which can be projected down to 2D without throwing out a lot of variance i.e. look at the PCA diagnostics to see whether 90% of the original variance has been preserved. If not, then you might not want to be so aggressive in your PCA.

New Code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import seaborn as sns
%matplotlib inline

# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('/Users/angus/Desktop/Downloads/stackoverflow.csv', usecols[0, 2, 4],names=['freq', 'visit_length', 'conversion_cnt'],header=0).dropna()

df.describe()

#Normalize the data
df_norm = (df - df.mean()) / (df.max() - df.min())

num_clusters = 5

pca=PCA(n_components=2)
UnNormdata2D = pca.fit_transform(df_norm)

# Check the resulting varience
var = pca.explained_variance_ratio_
print "Varience after PCA: ",var

#Normalize again following PCA: data2D
data2D = (UnNormdata2D - UnNormdata2D.mean()) / (UnNormdata2D.max()-UnNormdata2D.min())

print "Data2D: "
print data2D

km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(data2D)

labels=km.labels_
centers2D = km.cluster_centers_

colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]

plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],marker='x',s=150.0,color='purple')
plt.show()

Plot:

plot from code above

Output:

Varience after PCA:  [ 0.65725709  0.29875307]
Data2D: 
[[-0.00338421 -0.0009403 ]
[-0.00512081 -0.00095038]
[-0.00512081 -0.00095038]
..., 
[-0.00477349 -0.00094836]
[-0.00373153 -0.00094232]
[-0.00512081 -0.00095038]]
Initialization complete
Iteration  0, inertia 51.225
Iteration  1, inertia 38.597
Iteration  2, inertia 36.837
...
...
Converged at iteration 31

Hope this helps!

like image 142
AN6U5 Avatar answered Nov 03 '22 14:11

AN6U5


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('stackoverflow.csv', usecols=[0, 2, 4], names=['freq', 'visit_length', 'conversion_cnt'], header=0).dropna()
df.describe()

Out[3]: 
              freq  visit_length  conversion_cnt
count  289705.0000   289705.0000     289705.0000
mean        0.2624       20.7598          0.0748
std         0.4399       55.0571          0.2631
min         0.0000        1.0000          0.0000
25%         0.0000        6.0000          0.0000
50%         0.0000       10.0000          0.0000
75%         1.0000       21.0000          0.0000
max         1.0000     2500.0000          1.0000

# binarlize freq and conversion_cnt
df.freq = np.where(df.freq > 1.0, 1, 0)
df.conversion_cnt = np.where(df.conversion_cnt > 0.0, 1, 0)

feature_names = df.columns
X_raw = df.values

transformer = PCA(n_components=2)
X_2d = transformer.fit_transform(X_raw)
# over 99.9% variance captured by 2d data
transformer.explained_variance_ratio_

Out[4]: array([  9.9991e-01,   6.6411e-05])

# do clustering
estimator = KMeans(n_clusters=5, init='k-means++', n_init=10, verbose=1)
estimator.fit(X_2d)

labels = estimator.labels_
colors = ['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]

fig, ax = plt.subplots()
ax.scatter(X_2d[:,0], X_2d[:,1], c=label_color)
ax.scatter(estimator.cluster_centers_[:,0], estimator.cluster_centers_[:,1], marker='x', s=50, c='r')

enter image description here

KMeans tries to minimize within-group Euclidean distance, and this may or may not be appropriate for your data. Just based on the graph, I would consider a Gaussian Mixture Model to do the unsupervised clustering.

Also, if you have superior knowledge on which observations might be classified into which category/label, you can do a semi-supervised learning.

like image 23
Jianxun Li Avatar answered Nov 03 '22 14:11

Jianxun Li