I am performing clustering on a dataset using PySpark. To find the number of clusters I performed clustering over a range of values (2,20) and found the wsse
(within-cluster sum of squares) values for each value of k
. This where I found something unusual. According to my understanding when you increase the number of clusters, the wsse
decreases monotonically. But results I got say otherwise. I 'm displaying wsse
for first few clusters only
Results from spark
For k = 002 WSSE is 255318.793358
For k = 003 WSSE is 209788.479560
For k = 004 WSSE is 208498.351074
For k = 005 WSSE is 142573.272672
For k = 006 WSSE is 154419.027612
For k = 007 WSSE is 115092.404604
For k = 008 WSSE is 104753.205635
For k = 009 WSSE is 98000.985547
For k = 010 WSSE is 95134.137071
If you look at the wsse
value of for k=5
and k=6
, you'll see the wsse
has increased. I turned to sklearn to see if I get similar results. The codes I used for spark and sklearn are in the appendix section towards the end of the post. I have tried to use same values for the parameters in spark and sklearn KMeans model. The following are the results from sklearn and they are as I expected them to be - monotonically decreasing.
Results from sklearn
For k = 002 WSSE is 245090.224247
For k = 003 WSSE is 201329.888159
For k = 004 WSSE is 166889.044195
For k = 005 WSSE is 142576.895154
For k = 006 WSSE is 123882.070776
For k = 007 WSSE is 112496.692455
For k = 008 WSSE is 102806.001664
For k = 009 WSSE is 95279.837212
For k = 010 WSSE is 89303.574467
I am not sure as to why I the wsse
values increase in Spark. I tried using different datasets and found similar behavior there as well. Is there someplace I am going wrong? Any clues would be great.
The dataset is located here.
Read the data and set declare variables
# get data
import pandas as pd
url = "https://raw.githubusercontent.com/vectosaurus/bb_lite/master/3.0%20data/adult_comp_cont.csv"
df_pandas = pd.read_csv(url)
df_spark = sqlContext(df_pandas)
target_col = 'high_income'
numeric_cols = [i for i in df_pandas.columns if i !=target_col]
k_min = 2 # 2 in inclusive
k_max = 21 # 2i is exlusive. will fit till 20
max_iter = 1000
seed = 42
This is the code I am using for getting the sklearn results:
from sklearn.cluster import KMeans as KMeans_SKL
from sklearn.preprocessing import StandardScaler as StandardScaler_SKL
ss = StandardScaler_SKL(with_std=True, with_mean=True)
ss.fit(df_pandas.loc[:, numeric_cols])
df_pandas_scaled = pd.DataFrame(ss.transform(df_pandas.loc[:, numeric_cols]))
wsse_collect = []
for i in range(k_min, k_max):
km = KMeans_SKL(random_state=seed, max_iter=max_iter, n_clusters=i)
_ = km.fit(df_pandas_scaled)
wsse = km.inertia_
print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse))
wsse_collect.append(wsse)
This is the code I am using for getting the spark results
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml.clustering import KMeans
standard_scaler_inpt_features = 'ss_features'
kmeans_input_features = 'features'
kmeans_prediction_features = 'prediction'
assembler = VectorAssembler(inputCols=numeric_cols, outputCol=standard_scaler_inpt_features)
assembled_df = assembler.transform(df_spark)
scaler = StandardScaler(inputCol=standard_scaler_inpt_features, outputCol=kmeans_input_features, withStd=True, withMean=True)
scaler_model = scaler.fit(assembled_df)
scaled_data = scaler_model.transform(assembled_df)
wsse_collect_spark = []
for i in range(k_min, k_max):
km = KMeans(featuresCol=kmeans_input_features, predictionCol=kmeans_prediction_col,
k=i, maxIter=max_iter, seed=seed)
km_fit = km.fit(scaled_data)
wsse_spark = km_fit.computeCost(scaled_data)
wsse_collect_spark .append(wsse_spark)
print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse_spark))
UPDATE
Following @Michail N's answer, I changed the tol
and maxIter
values for the Spark KMeans
model. I re-ran the code but I saw the same behavior repeating. But since Michail mentioned
Spark MLlib, in fact, implements K-means||
I increased the number of initSteps
by a factor of 50 and re-ran the process which gave the following results.
For k = 002 WSSE is 255318.718684
For k = 003 WSSE is 212364.906298
For k = 004 WSSE is 185999.709027
For k = 005 WSSE is 168616.028321
For k = 006 WSSE is 123879.449228
For k = 007 WSSE is 113646.930680
For k = 008 WSSE is 102803.889178
For k = 009 WSSE is 97819.497501
For k = 010 WSSE is 99973.198132
For k = 011 WSSE is 89103.510831
For k = 012 WSSE is 84462.110744
For k = 013 WSSE is 78803.619605
For k = 014 WSSE is 82174.640611
For k = 015 WSSE is 79157.287447
For k = 016 WSSE is 75007.269644
For k = 017 WSSE is 71610.292172
For k = 018 WSSE is 68706.739299
For k = 019 WSSE is 65440.906151
For k = 020 WSSE is 66396.106118
The increase of wsse
from k=5
and k=6
disappears. Although the behavior persists if you look at k=13
and k=14
and elsewhere, but atleast I got to know where this was coming from.
There is nothing wrong with WSSE not decreasing monotonically. In theory WSSE must decrease monotonically if the cluster is optimal, that means that from all the possible k-centers clusters the one with the best WSSE.
The problem is that K-means is not necessarily able to find the optimal clustering for a given k. Its iterative process can converge from a random starting point to a local minimum, which may be good but is not optimal.
There are methods like K-means++ and Kmeans|| that have variants of selection algorithms that are more likely to choose diverse, separated centroids and lead more reliably to a good clustering and Spark MLlib, in fact, implements K-means||. However, all still have an element of randomness in selection and can’t guarantee an optimal clustering.
The random starting set of clusters chosen for k=6 perhaps led to a particularly suboptimal clustering, or it may have stopped early before it reached its local optimum.
You can improve it by changing the parameters of Kmeans manually. The algorithm has a threshold via tol that controls the minimum amount of cluster centroid movement considered significant, where lower values mean the K-means algorithm will let the centroids continue to move longer.
Increasing the maximum number of iterations with maxIter also prevents it from potentially stopping too early at the cost of possibly more computation.
So my advice is to re-run you clustering with
...
#increase from default 20
max_iter= 40
#decrase from default 0.0001
tol = 0.00001
km = KMeans(featuresCol=kmeans_input_features, predictionCol=kmeans_prediction_col, k=i, maxIter=max_iter, seed=seed , tol = tol )
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With