I am relatively new to Spark and Scala.
I am starting with the following dataframe (single column made out of a dense Vector of Doubles):
scala> val scaledDataOnly_pruned = scaledDataOnly.select("features")
scaledDataOnly_pruned: org.apache.spark.sql.DataFrame = [features: vector]
scala> scaledDataOnly_pruned.show(5)
+--------------------+
| features|
+--------------------+
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
+--------------------+
A straight conversion to RDD yields an instance of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] :
scala> val scaledDataOnly_rdd = scaledDataOnly_pruned.rdd
scaledDataOnly_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[32] at rdd at <console>:66
Does anyone know how to convert this DF to an instance of org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] instead? My various attempts have been unsuccessful so far.
Thank you in advance for any pointers!
Convert Using createDataFrame Method This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema. We can observe the column names are following a default sequence of names based on a default template.
Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.
A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values.
An IndexedRowMatrix is similar to a RowMatrix but with meaningful row indices. It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector. An IndexedRowMatrix can be created from an RDD[IndexedRow] instance, where IndexedRow is a wrapper over (Long, Vector) .
Just found out:
val scaledDataOnly_rdd = scaledDataOnly_pruned.map{x:Row => x.getAs[Vector](0)}
import org.apache.spark.mllib.linalg.Vectors
scaledDataOnly
.rdd
.map{
row => Vectors.dense(row.getAs[Seq[Double]]("features").toArray)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With