Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

I am relatively new to Spark and Scala.

I am starting with the following dataframe (single column made out of a dense Vector of Doubles):

scala> val scaledDataOnly_pruned = scaledDataOnly.select("features")
scaledDataOnly_pruned: org.apache.spark.sql.DataFrame = [features: vector]

scala> scaledDataOnly_pruned.show(5)
+--------------------+
|            features|
+--------------------+
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
+--------------------+

A straight conversion to RDD yields an instance of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] :

scala> val scaledDataOnly_rdd = scaledDataOnly_pruned.rdd
scaledDataOnly_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[32] at rdd at <console>:66

Does anyone know how to convert this DF to an instance of org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] instead? My various attempts have been unsuccessful so far.

Thank you in advance for any pointers!

like image 293
Yeye Avatar asked Oct 09 '15 22:10

Yeye


People also ask

How you will convert RDD into data frame and datasets?

Convert Using createDataFrame Method This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema. We can observe the column names are following a default sequence of names based on a default template.

How do I convert RDD to DataSet in PySpark?

Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.

What is Spark vector?

A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values.

What is basic difference between row matrix and IndexedRow Matrix?

An IndexedRowMatrix is similar to a RowMatrix but with meaningful row indices. It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector. An IndexedRowMatrix can be created from an RDD[IndexedRow] instance, where IndexedRow is a wrapper over (Long, Vector) .


2 Answers

Just found out:

val scaledDataOnly_rdd = scaledDataOnly_pruned.map{x:Row => x.getAs[Vector](0)}
like image 68
Yeye Avatar answered Oct 04 '22 16:10

Yeye


import org.apache.spark.mllib.linalg.Vectors

scaledDataOnly
   .rdd
   .map{
      row => Vectors.dense(row.getAs[Seq[Double]]("features").toArray)
     }
like image 23
Santoshi M Avatar answered Oct 04 '22 16:10

Santoshi M