Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert org.apache.spark.rdd.RDD[Array[Double]] to Array[Double] which is required by Spark MLlib

I am trying to implement KMeans using Apache Spark.

val data = sc.textFile(irisDatasetString)
val parsedData = data.map(_.split(',').map(_.toDouble)).cache()

val clusters = KMeans.train(parsedData,3,numIterations = 20)

on which I get the following error :

error: overloaded method value train with alternatives:
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int,initializationMode: String)org.apache.spark.mllib.clustering.KMeansModel
 cannot be applied to (org.apache.spark.rdd.RDD[Array[Double]], Int, numIterations: Int)
       val clusters = KMeans.train(parsedData,3,numIterations = 20)

so I tried converting Array[Double] to Vector as shown here

scala> val vectorData: Vector = Vectors.dense(parsedData)

on which I got the following error :

error: type Vector takes type parameters
   val vectorData: Vector = Vectors.dense(parsedData)
                   ^
error: overloaded method value dense with alternatives:
  (values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
  (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
 cannot be applied to (org.apache.spark.rdd.RDD[Array[Double]])
       val vectorData: Vector = Vectors.dense(parsedData)

So I am inferring that org.apache.spark.rdd.RDD[Array[Double]] is not the same as Array[Double]

How can I proceed with my data as org.apache.spark.rdd.RDD[Array[Double]] ? or how can I convert org.apache.spark.rdd.RDD[Array[Double]] to Array[Double] ?

like image 405
sand Avatar asked Jan 08 '15 06:01

sand


1 Answers

KMeans.train is expecting RDD[Vector] instead of RDD[Array[Double]]. It seems to me that all you need to do is change

val parsedData = data.map(_.split(',').map(_.toDouble)).cache()

to

val parsedData = data.map(x => Vectors.dense(x.split(',').map(_.toDouble))).cache()
like image 167
Mike Park Avatar answered Oct 18 '22 14:10

Mike Park