How do I get cluster labels when I use Spark's mllib in pyspark? In sklearn, this can be done easily by
kmeans = MiniBatchKMeans(n_clusters=k,random_state=1)
temp=kmeans.fit(data)
cluster_labels=temp.labels_
In mllib, I run kmeans as :
temp = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")
This returns a KmeansModel
object. This class doesn't have any equivalent of sklearn's labels_
I am unable to figure to out how to get the labels in mllib's kmeans
This is an old question. However, that was then, and this is now, and now in pyspark 2.2 KMeans has no train method and the model has no predict method. The correct way to get the labels is
kmeans = KMeans().setK(k).setSeed(1)
model = kmeans.fit(data)
prediction = model.transform(data).select('prediction').collect()
labels = [p.prediction for p in prediction ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With