Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a Spark DataFrame from an RDD of lists

I have an rdd (we can call it myrdd) where each record in the rdd is of the form:

[('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]

I would like to convert this into a DataFrame in pyspark - what is the easiest way to do this?

like image 884
mgoldwasser Avatar asked Apr 07 '15 20:04

mgoldwasser


People also ask

How do you convert a Spark RDD into a DataFrame?

Convert Using createDataFrame Method The SparkSession object has a utility method for creating a DataFrame – createDataFrame. This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema.

Can we convert RDD to DataFrame in Pyspark?

The PySpark SQL package is imported into the environment to convert RDD to Dataframe in PySpark. The Spark Session is defined with 'Spark RDD to Dataframe PySpark' as App name. The "SampleDepartment" value is created in which data is input. The "ResiDD" value is created, which stores the resilient distributed datasets.

How do I convert RDD to DataSet in Pyspark?

Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.


1 Answers

How about use the toDF method? You only need add the field names.

df = rdd.toDF(['column', 'value'])
like image 102
dapangmao Avatar answered Oct 07 '22 18:10

dapangmao