For example, suppose I have the DataFrame:
var myDF = sc.parallelize(Seq(("one",1),("two",2),("three",3))).toDF("a", "b")
I can convert it to a RDD[(String, Int)]
with a map:
var myRDD = myDF.map(r => (r(0).asInstanceOf[String], r(1).asInstanceOf[Int]))
Is there a better way to do this, maybe using the DF schema?
PySpark dataFrameObject. rdd is used to convert PySpark DataFrame to RDD; there are several transformations that are not available in DataFrame but present in RDD hence you often required to convert PySpark DataFrame to RDD.
If you have semi-structured data, you can create DataFrame from the existing RDD by programmatically specifying the schema.
Convert Using createDataFrame Method The SparkSession object has a utility method for creating a DataFrame – createDataFrame. This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema.
Using pattern matching over Row
:
import org.apache.spark.sql.Row
myDF.map{case Row(a: String, b: Int) => (a, b)}
In Spark 1.6+ you can use Dataset
as follows:
myDF.as[(String, Int)].rdd
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With