Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are Pyspark and Pandas certified to work together? [closed]

I am facing a lot of issues integrating/adding Pyspark dataframes to existing Pandas code.

1) If I convert Pandas dataframes to Pyspark dataframes, multiple operations do not translate well since Pyspark dataframes do not seem to be as rich as Pandas dataframes.

2) If I choose to use Pyspark dataframes and Pandas to handle different datasets within the same code, Pyspark transformations(like map) do not seem to work at all when the function called through map contains any pandas dataframes.

I have existing code in Python that uses pandas and numpy; and works fine on a single machine. My initial attempt to translate the entire code to Spark dataframes failed since Spark dataframes do not support many operations that Pandas does.

Now, I am trying to apply pyspark to the existing code to gain from Pyspark's distributed computations. Using Spark 2.1.0(Cloudera parcel) and Anaconda distribution - with Python 2.7.14.

Are Pyspark and Pandas certified to work together? Any good references where I can find documentation and examples of using them together?

Your responses will be highly appreciated.

like image 733
user8708009 Avatar asked Dec 26 '17 07:12

user8708009


1 Answers

I don't think pySpark is a replacement of Pandas. As per my understanding

I will pick

  • PySpark where I want to do distributed computing on huge data set, it might not have so many inbuilt functions like Pandas as it's just evaluating as the main focus was distributed computing
  • Pandas with limited amount (can fit in one machine) of data where I want to leverage many inbuilt data manipulation functions.

Edit: (Incorporating comments)

My challenge is that I have an existing pandas based python code that I want to run in distributed way. Hence the need to use pandas within pyspark framework.

PySpark and Pandas both refer their data structure as 'dataframe' but they are different platforms at runtime.

All we can do is, rewrite application from pandas to PySpark (suggested). If any functionality is not available in PySpark we need to implement it by UDF or UDAF.

Another alternate solution is converting Pandas dataframe to PySpark but that's generally not suggested because Pandas dataframe is not distributed and it can be a bottle neck in future.

Example (Pandas to PySpark):

import pandas as pd
pandas_df = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
spark_df = spark.createDataFrame(pandas_df)
like image 173
mrsrinivas Avatar answered Oct 04 '22 03:10

mrsrinivas