get cluster labels in mllib kmeans pyspark

Question

How do I get cluster labels when I use Spark's mllib in pyspark? In sklearn, this can be done easily by

kmeans = MiniBatchKMeans(n_clusters=k,random_state=1)
temp=kmeans.fit(data)
cluster_labels=temp.labels_

In mllib, I run kmeans as :

temp = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")

This returns a KmeansModel object. This class doesn't have any equivalent of sklearn's labels_

I am unable to figure to out how to get the labels in mllib's kmeans

David Makovoz · Accepted Answer

This is an old question. However, that was then, and this is now, and now in pyspark 2.2 KMeans has no train method and the model has no predict method. The correct way to get the labels is

kmeans = KMeans().setK(k).setSeed(1)
model = kmeans.fit(data)
prediction = model.transform(data).select('prediction').collect()
labels = [p.prediction for p in prediction ]

get cluster labels in mllib kmeans pyspark

Tags:

python

apache-spark

scikit-learn

pyspark

apache-spark-mllib

krackoder

1 Answers

David Makovoz

Recent Activity

Donate For Us

get cluster labels in mllib kmeans pyspark

Tags:

python

apache-spark

scikit-learn

pyspark

apache-spark-mllib

krackoder

1 Answers

David Makovoz

Related questions

Recent Activity

Donate For Us