Convert a pandas dataframe to a PySpark dataframe [duplicate]

Tags:

I have a script with the below setup.

I am using:

1) Spark dataframes to pull data in 2) Converting to pandas dataframes after initial aggregatioin 3) Want to convert back to Spark for writing to HDFS

The conversion from Spark --> Pandas was simple, but I am struggling with how to convert a Pandas dataframe back to spark.

Can you advise?

from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd

def create_session(appname):
    spark_session = SparkSession\
        .builder\
        .appName(appname)\
        .master('yarn')\
        .config("hive.metastore.uris", "thrift://uds-far-mn1.dab.02.net:9083")\
        .enableHiveSupport()\
        .getOrCreate()
    return spark_session
### START MAIN ###
if __name__ == '__main__':
    spark_session = create_session('testing_files')

I've tried the below - no errors, just no data! To confirm, df6 does have data & is a pandas dataframe

df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
sdf.show()

729

asked Oct 23 '18 07:10

kikee1222

Video Answer

1 Answers

Here we go:

# Spark to Pandas
df_pd = df.toPandas()

# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)

answered Sep 21 '22 00:09

Andrea

Related questions
                            
                                list of columns in common in two pandas dataframes
                            
                                How to add a subtitle to an Altair-generated chart
                            
                                Custom double star operator for a class?
                            
                                Graphing in Python 3.x
                            
                                Pycurl and io.StringIO - pycurl.error: (23, 'Failed writing body)
                            
                                'str' object has no attribute 'decode' in Python3
                            
                                Print real roots only in numpy
                            
                                How to create a 8 digit Unique ID in Python?
                            
                                How to add attention layer to a Bi-LSTM
                            
                                UnicodeDecodeError in Python 3 when importing a CSV file
                            
                                convert python dataframe to list [duplicate]
                            
                                Strict comparison
                            
                                tkinter canvas image not displaying
                            
                                How can I remove everything in a string until a character(s) are seen in Python
                            
                                no module named pkg_resources.py2_warn pyinstaller
                            
                                Cryptography tools for python 3
                            
                                Insert and update with core SQLAlchemy
                            
                                `xrange(2**100)` -> OverflowError: long int too large to convert to int
                            
                                ValueError: substring not found, What am I doing wrong?
                            
                                How to print Specific key value from a dictionary?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert a pandas dataframe to a PySpark dataframe [duplicate]

Tags:

python-3.x

pandas

apache-spark-sql

pyspark

pyspark-sql

kikee1222

People also ask

Video Answer

1 Answers

Andrea

Recent Activity

Donate For Us