Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert RDD of Vector in LabeledPoint using Scala - MLLib in Apache Spark

I'm using MLlib of Apache-Spark and Scala. I need to convert a group of Vector

import org.apache.spark.mllib.linalg.{Vector, Vectors}    
import org.apache.spark.mllib.regression.LabeledPoint    

in a LabeledPoint in order to apply the algorithms of MLLib
Each vector is composed of Double value of 0.0 (false) or 1.0 (true). All the vectors are saved in a RDD, so the final RDD is of the type

    val data_tmp: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]      

So, in the RDD there are vectors create with

 def createArray(values: List[String]) : Vector =
    {                
        var arr : Array[Double] = new Array[Double](tags_table.size)
        tags_table.foreach(x => arr(x._2) =  if (values.contains(x._1)) 1.0 else 0.0 )
        val dv: Vector = Vectors.dense(arr)
        return dv

        }
    /*each element of result is a List[String]*/
    val data_tmp=result.map(x=> createArray(x._2)) 
    val data: RowMatrix = new RowMatrix(data_tmp)        

How I can create from this RDD (data_tmp) or from the RowMatrix (data) a LabeledPoint set for using the MLLib algorithms? For example i need to apply the SVMs linear alghoritms show here

like image 250
Alessio Conese Avatar asked Nov 09 '14 15:11

Alessio Conese


1 Answers

I found the solution:

    def createArray(values: List[String]) : Vector =
    {                
          var arr : Array[Double] = new Array[Double](tags_table.size)
          tags_table.foreach(x => arr(x._2) =  if (values.contains(x._1)) 1.0 else 0.0 )
          val dv: Vector = Vectors.dense(arr)
          return dv

    }
    val data_tmp=result.map(x=> createArray(x._2))       
    val parsedData = data_tmp.map { line => LabeledPoint(1.0,line) }       
like image 172
Alessio Conese Avatar answered Nov 15 '22 05:11

Alessio Conese