How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

Tags:

I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD. To get a (label:string, features:vector) DataFrame which is the Schema required by most of the ml algorithm's libraries. I know it can be done because HashingTF ml Library outputs a vector when given a features column of a DataFrame.

temp_df = sqlContext.createDataFrame(temp_rdd, StructType([
        StructField("label", DoubleType(), False),
        StructField("tokens", ArrayType(StringType()), False)
    ]))

#assumming there is an RDD (double,array(strings))

hashingTF = HashingTF(numFeatures=COMBINATIONS, inputCol="tokens", outputCol="features")

ndf = hashingTF.transform(temp_df)
ndf.printSchema()

#outputs 
#root
#|-- label: double (nullable = false)
#|-- tokens: array (nullable = false)
#|    |-- element: string (containsNull = true)
#|-- features: vector (nullable = true)

So my question is, can I somehow having an RDD of (String, SparseVector) convert it to a DataFrame of (String, vector). I tried with the usual sqlContext.createDataFrame but there is no DataType that fits the needs I have.

df = sqlContext.createDataFrame(rdd,StructType([
        StructField("label" , StringType(),True),
        StructField("features" , ?Type(),True)
    ]))

508

asked Sep 23 '15 16:09

Orangel Marquez

2 Answers

You have to use VectorUDT here:

# In Spark 1.x
# from pyspark.mllib.linalg import SparseVector, VectorUDT
from pyspark.ml.linalg import SparseVector, VectorUDT

temp_rdd = sc.parallelize([
    (0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
    (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])

schema = StructType([
    StructField("label", DoubleType(), True),
    StructField("features", VectorUDT(), True)
])

temp_rdd.toDF(schema).printSchema()

## root
##  |-- label: double (nullable = true)
##  |-- features: vector (nullable = true)

Just for completeness Scala equivalent:

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{DoubleType, StructType}
// In Spark 1x.
// import org.apache.spark.mllib.linalg.{Vectors, VectorUDT}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType

val schema = new StructType()
  .add("label", DoubleType)
   // In Spark 1.x
   //.add("features", new VectorUDT())
  .add("features",VectorType)

val temp_rdd: RDD[Row]  = sc.parallelize(Seq(
  Row(0.0, Vectors.sparse(4, Seq((1, 1.0), (3, 5.5)))),
  Row(1.0, Vectors.sparse(4, Seq((0, -1.0), (2, 0.5))))
))

spark.createDataFrame(temp_rdd, schema).printSchema

// root
// |-- label: double (nullable = true)
// |-- features: vector (nullable = true)

105

answered Oct 21 '22 08:10

zero323

While @zero323 answer https://stackoverflow.com/a/32745924/1333621 makes sense, and I wish it worked for me - the rdd underlying the dataframe, sqlContext.createDataFrame(temp_rdd, schema), the still contained SparseVectors types I had to do the following to convert to DenseVector types - if someone has a shorter/better way I want to know

temp_rdd = sc.parallelize([
    (0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
    (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])

schema = StructType([
    StructField("label", DoubleType(), True),
    StructField("features", VectorUDT(), True)
])

temp_rdd.toDF(schema).printSchema()
df_w_ftr = temp_rdd.toDF(schema)

print 'original convertion method: ',df_w_ftr.take(5)
print('\n')
temp_rdd_dense = temp_rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
print type(temp_rdd_dense), type(temp_rdd)
print 'using map and toArray:', temp_rdd_dense.take(5)

temp_rdd_dense.toDF().show()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

original convertion method:  [Row(label=0.0, features=SparseVector(4, {1: 1.0, 3: 5.5})), Row(label=1.0, features=SparseVector(4, {0: -1.0, 2: 0.5}))]


<class 'pyspark.rdd.PipelinedRDD'> <class 'pyspark.rdd.RDD'>
using map and toArray: [Row(features=DenseVector([0.0, 1.0, 0.0, 5.5]), label=0.0), Row(features=DenseVector([-1.0, 0.0, 0.5, 0.0]), label=1.0)]

+------------------+-----+
|          features|label|
+------------------+-----+
| [0.0,1.0,0.0,5.5]|  0.0|
|[-1.0,0.0,0.5,0.0]|  1.0|
+------------------+-----+

answered Oct 21 '22 08:10

meyerson

Related questions
                            
                                Spark runs out of memory when grouping by key
                            
                                How to upgrade Spark to newer version?
                            
                                Spark case class - decimal type encoder error "Cannot up cast from decimal"
                            
                                Read all Parquet files saved in a folder via Spark
                            
                                How to use first and last function in pyspark?
                            
                                How to save a huge pandas dataframe to hdfs?
                            
                                how to pass python package to spark job and invoke main file from package with arguments
                            
                                scala vs java for Spark? [closed]
                            
                                Spark jobs finishes but application takes time to close
                            
                                Is foreachRDD executed on the Driver?
                            
                                Add one more StructField to schema
                            
                                Loading compressed gzipped csv file in Spark 2.0
                            
                                What is StringIndexer , VectorIndexer, and how to use them?
                            
                                Mapping Spark DataSet row values into new hash column
                            
                                External Hive Table Refresh table vs MSCK Repair
                            
                                get first N elements from dataframe ArrayType column in pyspark
                            
                                Spark: save DataFrame partitioned by "virtual" column
                            
                                Spark: get number of cluster cores programmatically
                            
                                How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame
                            
                                what is exact difference between Spark Transform in DStream and map.?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

Tags:

apache-spark

apache-spark-sql

pyspark

apache-spark-ml

apache-spark-mllib

Orangel Marquez

People also ask

2 Answers

zero323

meyerson

Recent Activity

Donate For Us