I have an rdd (we can call it myrdd) where each record in the rdd is of the form:
[('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]
I would like to convert this into a DataFrame in pyspark - what is the easiest way to do this?
Convert Using createDataFrame Method The SparkSession object has a utility method for creating a DataFrame – createDataFrame. This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema.
The PySpark SQL package is imported into the environment to convert RDD to Dataframe in PySpark. The Spark Session is defined with 'Spark RDD to Dataframe PySpark' as App name. The "SampleDepartment" value is created in which data is input. The "ResiDD" value is created, which stores the resilient distributed datasets.
Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.
How about use the toDF
method? You only need add the field names.
df = rdd.toDF(['column', 'value'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With