Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to initialize cluster centers for K-means in Spark MLlib?

Is there a way to initialize cluster centers while running K-Means in Spark MLlib?

I tried following :

model = KMeans.train(
    sc.parallelize(data), 3, maxIterations=0,
    initialModel = KMeansModel([(-1000.0,-1000.0),(5.0,5.0),(1000.0,1000.0)]))

initialModel and setInitialModel are not present in spark-mllib_2.10

like image 367
Harshit Avatar asked Feb 16 '16 07:02

Harshit


People also ask

How do you initialize a cluster for K-means?

Method for initialization: ' k-means++ ': selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. ' random ': choose n_clusters observations (rows) at random from data for the initial centroids.

What is cluster centers in K-means?

The k-means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like: The "cluster center" is the arithmetic mean of all the points belonging to the cluster.

Which of the following clustering algorithms are supported in Spark?

K-means is one of the most commonly used clustering algorithms for grouping data into a predefined number of clusters. The spark.


1 Answers

Initial model can set in Scala since Spark 1.5+ using setInitialModel which takes KMeansModel:

import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

val data = sc.parallelize(Seq(
    "[0.0, 0.0]", "[1.0, 1.0]", "[9.0, 8.0]", "[8.0,  9.0]"
)).map(Vectors.parse(_))

val initialModel = new KMeansModel(
   Array("[0.6,  0.6]", "[8.0,  8.0]").map(Vectors.parse(_))
)

val model = new KMeans()
  .setInitialModel(initialModel)
  .setK(2)
  .run(data)

and PySpark 1.6+ using initialModel parameter to train method:

from pyspark.mllib.clustering import KMeansModel, KMeans
from pyspark.mllib.linalg import Vectors

data = sc.parallelize([
    "[0.0, 0.0]", "[1.0, 1.0]", "[9.0, 8.0]", "[8.0,  9.0]"
]).map(Vectors.parse)

initialModel = KMeansModel([
    Vectors.parse(v) for v in ["[0.6,  0.6]", "[8.0,  8.0]"]])
model = KMeans.train(data, 2, initialModel=initialModel)

If any of these methods doesn't work it means that you're using an earlier version of Spark.

like image 164
zero323 Avatar answered Sep 20 '22 16:09

zero323