Create Spark DataFrame from Pandas DataFrame

Tags:

I'm trying to build a Spark DataFrame from a simple Pandas DataFrame. This are the steps I follow.

import pandas as pd
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = sqlContext.createDataFrame(pandas_df)
spark_df.printSchema()

Till' this point everything is OK. The output is:

root
|-- Letters: string (nullable = true)

The problem comes when I try to print the DataFrame:

spark_df.show()

This is the result:

An error occurred while calling o158.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker:
Error executing Jupyter command 'pyspark.daemon': [Errno 2] No such file or directory PYTHONPATH was:
/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/jars/spark-core_2.11-2.4.0.jar:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/: org.apache.spark.SparkException: No port number in pyspark.daemon's stdout

These are my Spark specifications:

SparkSession - hive

SparkContext

Spark UI

Version: v2.4.0

Master: local[*]

AppName: PySparkShell

This are my venv:

export PYSPARK_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS='lab'

Fact:

As the error mentions, it has to do with running pyspark from Jupyter. Running it with 'PYSPARK_PYTHON=python2.7' and 'PYSPARK_PYTHON=python3.6' works fine

346

asked Feb 14 '19 20:02

roldanx

1 Answers

Import and initialise findspark, create a spark session and then use the object to convert the pandas data frame to a spark data frame. Then add the new spark data frame to the catalogue. Tested and runs in both Jupiter 5.7.2 and Spyder 3.3.2 with python 3.6.6.

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
import pandas as pd

# Create a spark session
spark = SparkSession.builder.getOrCreate()

# Create pandas data frame and convert it to a spark data frame 
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = spark.createDataFrame(pandas_df)

# Add the spark data frame to the catalog
spark_df.createOrReplaceTempView('spark_df')

spark_df.show()
+-------+
|Letters|
+-------+
|      X|
|      Y|
|      Z|
+-------+

spark.catalog.listTables()
Out[18]: [Table(name='spark_df', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

answered Sep 22 '22 06:09

KRKirov

Related questions
                            
                                BatchNorm momentum convention PyTorch
                            
                                More efficient weighted Gini coefficient in Python
                            
                                What is the equivalent of Python Pandas value_counts in SQL?
                            
                                What are the advantages / disadvantages between the different predefined ArUco dictionaries?
                            
                                subprocess seems not working in pyinstaller exe file
                            
                                How to save GridSearchCV object?
                            
                                get stock data using python - not using quandl
                            
                                Pandas: Nesting Dataframes
                            
                                Set all values in one column to NaN if the corresponding values in another column are also NaN
                            
                                Pandas boolean Series won't plot
                            
                                Installing rpy2 on MacOS
                            
                                How to change the y axis to display percent (%) in Python Plotnine barplot?
                            
                                Can the Python bool() function raise an exception for an invalid argument?
                            
                                Can I retry for failed tests in Pytest
                            
                                Errno 13 while running docker-compose up
                            
                                keras: how to use learning rate decay with model.train_on_batch()
                            
                                With pybind11, how to split my code into multiple modules/files?
                            
                                Truly Private Variables in Python 3
                            
                                Trouble loading local modules only with AWS Lambda
                            
                                Pytest says 'ModuleNotFoundError' when using tox

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Create Spark DataFrame from Pandas DataFrame

Tags:

python

pandas

apache-spark-sql

pyspark

roldanx

People also ask

1 Answers

KRKirov

Recent Activity

Donate For Us