Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert a spark DataFrame to pandas DF

Is there a way to convert a Spark Df (not RDD) to pandas DF

I tried the following:

var some_df = Seq(
 ("A", "no"),
 ("B", "yes"),
 ("B", "yes"),
 ("B", "no")

 ).toDF(
"user_id", "phone_number")

Code:

%pyspark
pandas_df = some_df.toPandas()

Error:

 NameError: name 'some_df' is not defined

Any suggestions.

like image 297
data_person Avatar asked Jun 21 '18 00:06

data_person


People also ask

How you can convert a Spark DataFrame if DF to a pandas DataFrame?

Import the pandas library and create a Pandas Dataframe using the DataFrame() method. Create a spark session by importing the SparkSession from the pyspark library. Pass the Pandas dataframe to the createDataFrame() method of the SparkSession object. Print the DataFrame.

Which is faster pandas DF or Spark DF?

Spark DataFrame is distributed and hence processing in the Spark DataFrame is faster for a large amount of data. Pandas DataFrame is not distributed and hence processing in the Pandas DataFrame will be slower for a large amount of data.

Which method can be used to convert a Spark dataset to a DataFrame?

Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.

Can we use pandas DataFrame in PySpark?

pandas-on-Spark DataFrame and Spark DataFrame are virtually interchangeable. However, note that a new default index is created when pandas-on-Spark DataFrame is created from Spark DataFrame. See Default Index Type. In order to avoid this overhead, specify the column to use as an index when possible.


2 Answers

following should work

Sample DataFrame

    some_df = sc.parallelize([
     ("A", "no"),
     ("B", "yes"),
     ("B", "yes"),
     ("B", "no")]
     ).toDF(["user_id", "phone_number"])

Converting DataFrame to Pandas DataFrame

    pandas_df = some_df.toPandas()
like image 124
Gaurang Shah Avatar answered Oct 08 '22 02:10

Gaurang Shah


In my case the following conversion from spark dataframe to pandas dataframe worked:

pandas_df = spark_df.select("*").toPandas()
like image 37
Inna Avatar answered Oct 08 '22 00:10

Inna