Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert a pandas dataframe to a PySpark dataframe [duplicate]

I have a script with the below setup.

I am using:

1) Spark dataframes to pull data in 2) Converting to pandas dataframes after initial aggregatioin 3) Want to convert back to Spark for writing to HDFS

The conversion from Spark --> Pandas was simple, but I am struggling with how to convert a Pandas dataframe back to spark.

Can you advise?

from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd

def create_session(appname):
    spark_session = SparkSession\
        .builder\
        .appName(appname)\
        .master('yarn')\
        .config("hive.metastore.uris", "thrift://uds-far-mn1.dab.02.net:9083")\
        .enableHiveSupport()\
        .getOrCreate()
    return spark_session
### START MAIN ###
if __name__ == '__main__':
    spark_session = create_session('testing_files')

I've tried the below - no errors, just no data! To confirm, df6 does have data & is a pandas dataframe

df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
sdf.show()
like image 729
kikee1222 Avatar asked Oct 23 '18 07:10

kikee1222


People also ask

Can pandas DataFrame convert to DataFrame PySpark?

Spark provides a createDataFrame(pandas_dataframe) method to convert pandas to Spark DataFrame, Spark by default infers the schema based on the pandas data types to PySpark data types. If you want all data types to String use spark. createDataFrame(pandasDF. astype(str)) .

How do I change pandas code to PySpark?

The easiest way to convert Pandas DataFrames to PySpark is through Apache Arrow. Apache Arrow is a language-independent, in-memory columnar format that can be used to optimize the conversion between Spark and Pandas DataFrames when using toPandas() or createDataFrame().

How do I create a new DataFrame from an existing DataFrame in PySpark?

To create a PySpark DataFrame from an existing RDD, we will first create an RDD using the . parallelize() method and then convert it into a PySpark DataFrame using the . createDatFrame() method of SparkSession. To start using PySpark, we first need to create a Spark Session.

How do I convert pandas to spark dataframe?

Spark provides a createDataFrame (pandas_dataframe) method to convert Pandas to Spark DataFrame, Spark by default infers the schema based on the Pandas data types to PySpark data types. If you want all data types to String use spark.createDataFrame (pandasDF.astype (str)).

When to use NumPy instead of pyspark in pandas?

This method should only be used if the resulting pandas DataFrame is expected to be small, as all the data is loaded into the driver’s memory. pyspark.pandas.DataFrame.to_numpy

What is Topandas () in pyspark Dataframe?

toPandas () results in the collection of all records in the PySpark DataFrame to the driver program and should be done on a small subset of the data. running on larger dataset’s results in memory error and crashes the application. pandasDF = pysparkDF. toPandas () print(pandasDF) This yields the below panda’s dataframe.

How to improve the performance of spark dataframe conversion?

Note: this action will cause all records in Spark DataFrame to be sent to driver application which may cause performance issues. To improve performance, Apache Arrow can be enabled in Spark for the conversions. Refer to this article for more details:


Video Answer


1 Answers

Here we go:

# Spark to Pandas
df_pd = df.toPandas()

# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)
like image 75
Andrea Avatar answered Sep 21 '22 00:09

Andrea