I have generated my cluster centers from features of my data say 'Kmeans.data.txt' as you find in
https://github.com/apache/spark/blob/master/data/mllib/kmeans_data.txt
This was performed using KMeans in Spark MLib.
clusters.clusterCenters.foreach(println)
Any idea how to predict the clusters derived from this data?
Excerpt from the KMean MLlib clustering code snippet retrieved from Scala Spark
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// here is what I added to predict data points that are within the clusters
clusters.predict(parsedData).foreach(println)
It's pretty simple, if you read the KmeansModel's documentation, you will notice that it has two constructors, one of them:
new KMeansModel(clusterCenters: Array[Vector])
Therefore, you can instantiate an object having KMeans
' centroids. I show an example below.
import org.apache.spark.mllib.clustering.KMeansModel
import org.apache.spark.mllib.linalg.Vectors
val rdd = sc.parallelize(List(
Vectors.dense(Array(-0.1, 0.0, 0.0)),
Vectors.dense(Array(9.0, 9.0, 9.0)),
Vectors.dense(Array(3.0, 2.0, 1.0))))
val centroids = Array(
Vectors.dense(Array(0.0, 0.0, 0.0)),
Vectors.dense(Array(0.1, 0.1, 0.1)),
Vectors.dense(Array(0.2, 0.2, 0.2)),
Vectors.dense(Array(9.0, 9.0, 9.0)),
Vectors.dense(Array(9.1, 9.1, 9.1)),
Vectors.dense(Array(9.2, 9.2, 9.2)))
val model = new KMeansModel(clusterCenters=centroids)
model.predict(rdd).take(10)
// res13: Array[Int] = Array(0, 3, 2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With