Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get cluster labels in mllib kmeans pyspark

How do I get cluster labels when I use Spark's mllib in pyspark? In sklearn, this can be done easily by

kmeans = MiniBatchKMeans(n_clusters=k,random_state=1)
temp=kmeans.fit(data)
cluster_labels=temp.labels_

In mllib, I run kmeans as :

temp = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")

This returns a KmeansModel object. This class doesn't have any equivalent of sklearn's labels_

I am unable to figure to out how to get the labels in mllib's kmeans

like image 982
krackoder Avatar asked Dec 25 '22 07:12

krackoder


1 Answers

This is an old question. However, that was then, and this is now, and now in pyspark 2.2 KMeans has no train method and the model has no predict method. The correct way to get the labels is

kmeans = KMeans().setK(k).setSeed(1)
model = kmeans.fit(data)
prediction = model.transform(data).select('prediction').collect()
labels = [p.prediction for p in prediction ]
like image 97
David Makovoz Avatar answered Dec 29 '22 05:12

David Makovoz