Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert Spark RDD to pandas dataframe in ipython?

I have a RDD and I want to convert it to pandas dataframe. I know that to convert and RDD to a normal dataframe we can do

df = rdd1.toDF()

But I want to convert the RDD to pandas dataframe and not a normal dataframe. How can I do it?

like image 497
user2966197 Avatar asked Jan 15 '16 18:01

user2966197


People also ask

How do you convert a Spark RDD into a DataFrame?

Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.

Can we convert RDD to DataFrame in PySpark?

In PySpark, toDF() function of the RDD is used to convert RDD to DataFrame.

Can we create DataFrame using RDD?

This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema. We can observe the column names are following a default sequence of names based on a default template.

How you can convert a Spark DataFrame if DF to a Pandas DataFrame?

Import the pandas library and create a Pandas Dataframe using the DataFrame() method. Create a spark session by importing the SparkSession from the pyspark library. Pass the Pandas dataframe to the createDataFrame() method of the SparkSession object. Print the DataFrame.


2 Answers

You can use function toPandas():

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

This is only available if Pandas is installed and available.

>>> df.toPandas()  
   age   name
0    2  Alice
1    5    Bob
like image 195
jezrael Avatar answered Oct 22 '22 15:10

jezrael


You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.

For example, let's say I have a text file, flights.csv, that has been read in to an RDD:

flights = sc.textFile('flights.csv')

You can check the type:

type(flights)
<class 'pyspark.rdd.RDD'>

If you just use toPandas() on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:

# RDD to Spark DataFrame
sparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()

#Spark DataFrame to Pandas DataFrame
pdsDF = sparkDF.toPandas()

You can check the type:

type(pdsDF)
<class 'pandas.core.frame.DataFrame'>
like image 26
RKD314 Avatar answered Oct 22 '22 15:10

RKD314