How to initialize cluster centers for K-means in Spark MLlib?

Tags:

apache-spark

apache-spark-mllib

Is there a way to initialize cluster centers while running K-Means in Spark MLlib?

I tried following :

model = KMeans.train(
    sc.parallelize(data), 3, maxIterations=0,
    initialModel = KMeansModel([(-1000.0,-1000.0),(5.0,5.0),(1000.0,1000.0)]))

initialModel and setInitialModel are not present in spark-mllib_2.10

367

asked Feb 16 '16 07:02

Harshit

1 Answers

Initial model can set in Scala since Spark 1.5+ using setInitialModel which takes KMeansModel:

Click to copy

import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

val data = sc.parallelize(Seq(
    "[0.0, 0.0]", "[1.0, 1.0]", "[9.0, 8.0]", "[8.0,  9.0]"
)).map(Vectors.parse(_))

val initialModel = new KMeansModel(
   Array("[0.6,  0.6]", "[8.0,  8.0]").map(Vectors.parse(_))
)

val model = new KMeans()
  .setInitialModel(initialModel)
  .setK(2)
  .run(data)

and PySpark 1.6+ using initialModel parameter to train method:

Click to copy

from pyspark.mllib.clustering import KMeansModel, KMeans
from pyspark.mllib.linalg import Vectors

data = sc.parallelize([
    "[0.0, 0.0]", "[1.0, 1.0]", "[9.0, 8.0]", "[8.0,  9.0]"
]).map(Vectors.parse)

initialModel = KMeansModel([
    Vectors.parse(v) for v in ["[0.6,  0.6]", "[8.0,  8.0]"]])
model = KMeans.train(data, 2, initialModel=initialModel)

If any of these methods doesn't work it means that you're using an earlier version of Spark.

164

answered Sep 20 '22 16:09

zero323

Related questions
                            
                                Scala dependency on Spark installation
                            
                                how to limit the number of concurrent map tasks per executor?
                            
                                Compare data in two RDD in spark
                            
                                Scala error: '=' expected but ';' found
                            
                                Cluster hangs in 'ssh-ready' state using Spark 1.2.0 EC2 launch script
                            
                                How to construct ClassTag for Spark SQL DataFrame Mapping?
                            
                                How to set Spark executor memory?
                            
                                Spark output: log-style vs progress-style
                            
                                Hoes does Spark schedule a join?
                            
                                Spark NotSerializableException
                            
                                Weird behaviour with spark-submit
                            
                                SparkContext not serializable inside a companion object
                            
                                Spark - How to create a sparse matrix from item ratings
                            
                                How to convert RDD[(String, String)] into RDD[Array[String]]?
                            
                                Convert local Vectors to RDD[Vector]
                            
                                What happens when the intermediate output does not fit in RAM in Spark
                            
                                Apache Spark custom log4j configuration for application
                            
                                How does Spark DataFrame handles Pandas DataFrame that is larger than memory
                            
                                Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark
                            
                                java.lang.UnsupportedOperationException: 'Writing to a non-empty Cassandra Table is not allowed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With