Is there a way to convert a Spark Df (not RDD) to pandas DF
I tried the following:
var some_df = Seq(
("A", "no"),
("B", "yes"),
("B", "yes"),
("B", "no")
).toDF(
"user_id", "phone_number")
Code:
%pyspark
pandas_df = some_df.toPandas()
Error:
NameError: name 'some_df' is not defined
Any suggestions.
Import the pandas library and create a Pandas Dataframe using the DataFrame() method. Create a spark session by importing the SparkSession from the pyspark library. Pass the Pandas dataframe to the createDataFrame() method of the SparkSession object. Print the DataFrame.
Spark DataFrame is distributed and hence processing in the Spark DataFrame is faster for a large amount of data. Pandas DataFrame is not distributed and hence processing in the Pandas DataFrame will be slower for a large amount of data.
Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.
pandas-on-Spark DataFrame and Spark DataFrame are virtually interchangeable. However, note that a new default index is created when pandas-on-Spark DataFrame is created from Spark DataFrame. See Default Index Type. In order to avoid this overhead, specify the column to use as an index when possible.
following should work
Sample DataFrame
some_df = sc.parallelize([
("A", "no"),
("B", "yes"),
("B", "yes"),
("B", "no")]
).toDF(["user_id", "phone_number"])
Converting DataFrame to Pandas DataFrame
pandas_df = some_df.toPandas()
In my case the following conversion from spark dataframe to pandas dataframe worked:
pandas_df = spark_df.select("*").toPandas()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With