I have a script with the below setup.
I am using:
1) Spark dataframes to pull data in 2) Converting to pandas dataframes after initial aggregatioin 3) Want to convert back to Spark for writing to HDFS
The conversion from Spark --> Pandas was simple, but I am struggling with how to convert a Pandas dataframe back to spark.
Can you advise?
from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd
def create_session(appname):
spark_session = SparkSession\
.builder\
.appName(appname)\
.master('yarn')\
.config("hive.metastore.uris", "thrift://uds-far-mn1.dab.02.net:9083")\
.enableHiveSupport()\
.getOrCreate()
return spark_session
### START MAIN ###
if __name__ == '__main__':
spark_session = create_session('testing_files')
I've tried the below - no errors, just no data! To confirm, df6 does have data & is a pandas dataframe
df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
sdf.show()
Spark provides a createDataFrame(pandas_dataframe) method to convert pandas to Spark DataFrame, Spark by default infers the schema based on the pandas data types to PySpark data types. If you want all data types to String use spark. createDataFrame(pandasDF. astype(str)) .
The easiest way to convert Pandas DataFrames to PySpark is through Apache Arrow. Apache Arrow is a language-independent, in-memory columnar format that can be used to optimize the conversion between Spark and Pandas DataFrames when using toPandas() or createDataFrame().
To create a PySpark DataFrame from an existing RDD, we will first create an RDD using the . parallelize() method and then convert it into a PySpark DataFrame using the . createDatFrame() method of SparkSession. To start using PySpark, we first need to create a Spark Session.
Spark provides a createDataFrame (pandas_dataframe) method to convert Pandas to Spark DataFrame, Spark by default infers the schema based on the Pandas data types to PySpark data types. If you want all data types to String use spark.createDataFrame (pandasDF.astype (str)).
This method should only be used if the resulting pandas DataFrame is expected to be small, as all the data is loaded into the driver’s memory. pyspark.pandas.DataFrame.to_numpy
toPandas () results in the collection of all records in the PySpark DataFrame to the driver program and should be done on a small subset of the data. running on larger dataset’s results in memory error and crashes the application. pandasDF = pysparkDF. toPandas () print(pandasDF) This yields the below panda’s dataframe.
Note: this action will cause all records in Spark DataFrame to be sent to driver application which may cause performance issues. To improve performance, Apache Arrow can be enabled in Spark for the conversions. Refer to this article for more details:
Here we go:
# Spark to Pandas
df_pd = df.toPandas()
# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With