Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: How do I convert a Spark DataFrame to a RDD with type RDD[(Type1,Type2, ...)]?

For example, suppose I have the DataFrame:

var myDF = sc.parallelize(Seq(("one",1),("two",2),("three",3))).toDF("a", "b")

I can convert it to a RDD[(String, Int)] with a map:

var myRDD = myDF.map(r => (r(0).asInstanceOf[String], r(1).asInstanceOf[Int]))

Is there a better way to do this, maybe using the DF schema?

like image 203
evan.oman Avatar asked Jan 22 '16 19:01

evan.oman


People also ask

Can we convert DataFrame to RDD in spark?

PySpark dataFrameObject. rdd is used to convert PySpark DataFrame to RDD; there are several transformations that are not available in DataFrame but present in RDD hence you often required to convert PySpark DataFrame to RDD.

Can we create DataFrame to RDD?

If you have semi-structured data, you can create DataFrame from the existing RDD by programmatically specifying the schema.

Which method can be used to convert a spark Dataset to a DataFrame?

Convert Using createDataFrame Method The SparkSession object has a utility method for creating a DataFrame – createDataFrame. This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema.


1 Answers

Using pattern matching over Row:

import org.apache.spark.sql.Row

myDF.map{case Row(a: String, b: Int) => (a, b)}

In Spark 1.6+ you can use Dataset as follows:

myDF.as[(String, Int)].rdd
like image 71
zero323 Avatar answered Sep 22 '22 01:09

zero323