Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I convert pandas dataframe to spark rdd?

Tags:

pyspark

Pbm:

a) Read a local file into Panda dataframe say PD_DF b) Manipulate/Massge the PD_DF and add columns to dataframe c) Need to write PD_DF to HDFS using spark. How do I do it ?

like image 476
Ram Narayanan Avatar asked Dec 09 '22 03:12

Ram Narayanan


2 Answers

You can use the SQLContext object to invoke the createDataFrame method, which takes an input data which can optionally be a Pandas DataFrame object.

like image 146
trianta2 Avatar answered Jan 03 '23 04:01

trianta2


Lets say dataframe is of type pandas.core.frame.DataFrame then in spark 2.1 - Pyspark I did this

rdd_data = spark.createDataFrame(dataframe)\
                .rdd

In case, if you want to rename any columns or select only few columns, you do them before use of .rdd

Hope it works for you also.

like image 26
sam Avatar answered Jan 03 '23 03:01

sam