spark filter (delete) rows based on values from another dataframe [duplicate]

Question

I have a 'big' dataset (huge_df) with >20 columns. One of the columns is an id field (generated with pyspark.sql.functions.monotonically_increasing_id()).

Using some criteria I generate a second dataframe (filter_df), consisting of id values I want to filter later on from huge_df.

Currently I am using SQL syntax to do this:

filter_df.createOrReplaceTempView('filter_view')
huge_df = huge_df.where('id NOT IN (SELECT id FROM filter_view)')

Question 1: Is there a way to do this using Python only, i.e. without the need to register the TempView?

Question 2: Is there a completely different way to accomplish the same thing?

T. Gawęda · Accepted Answer

You can use JOIN

huge_df = huge_df.join(filter_df, huge_df.id == filter_df.id, "left_outer")
                 .where(filter_df.id.isNull())
                 .select([col(c) for c in huge_df.columns]

However it will cause expensive shuffle.

Logic is simple: do left join with filter_df on id fields and check if filter_df is null - if it is null, that means that there's no such row in filter_df

spark filter (delete) rows based on values from another dataframe [duplicate]

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

akoeltringer

1 Answers

T. Gawęda

Recent Activity

Donate For Us

spark filter (delete) rows based on values from another dataframe [duplicate]

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

akoeltringer

1 Answers

T. Gawęda

Related questions

Recent Activity

Donate For Us