Pbm:
a) Read a local file into Panda dataframe say PD_DF b) Manipulate/Massge the PD_DF and add columns to dataframe c) Need to write PD_DF to HDFS using spark. How do I do it ?
You can use the SQLContext
object to invoke the createDataFrame
method, which takes an input data
which can optionally be a Pandas DataFrame
object.
Lets say dataframe
is of type pandas.core.frame.DataFrame then in spark 2.1 - Pyspark I did this
rdd_data = spark.createDataFrame(dataframe)\
.rdd
In case, if you want to rename any columns or select only few columns, you do them before use of .rdd
Hope it works for you also.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With