How to merge pyspark and pandas dataframes

Question

I have a very large pyspark dataframe and a smaller pandas dataframe which I read in as follows:

df1 = spark.read.csv("/user/me/data1/")
df2 = pd.read_csv("data2.csv")

Both dataframes include columns labelled "A" and "B". I would like to create another pyspark dataframe with only those rows from df1 where the entries in columns "A" and "B" occur in those columns with the same name in df2. That is to filter df1 using columns "A" and "B" of df2.

Normally I think this would be a join (implemented with merge) but how do you join a pandas dataframe with a pyspark one?

I can't afford to convert df1 to a pandas dataframe.

Ankit Kumar Namdeo · Accepted Answer

you can either pass the schema while converting from pandas dataframe to pyspark dataframe like this:

from pyspark.sql.types import *
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)])
df = sqlContext.createDataFrame(pandas_dataframe, schema)

or you can use the hack i have used in this function:

def create_spark_dataframe(file_name):
    """
    will return the spark dataframe input pandas dataframe
    """
    pandas_data_frame = pd.read_csv(file_name)
    for col in pandas_data_frame.columns:
      if ((pandas_data_frame[col].dtypes != np.int64) & (pandas_data_frame[col].dtypes != np.float64)):
        pandas_data_frame[col] = pandas_data_frame[col].fillna('')

    spark_data_frame = sqlContext.createDataFrame(pandas_data_frame)
    return spark_data_frame

How to merge pyspark and pandas dataframes

Tags:

python

pandas

apache-spark

pyspark

graffe

1 Answers

Ankit Kumar Namdeo

Recent Activity

Donate For Us

How to merge pyspark and pandas dataframes

Tags:

python

pandas

apache-spark

pyspark

graffe

1 Answers

Ankit Kumar Namdeo

Related questions

Recent Activity

Donate For Us