Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark filter (delete) rows based on values from another dataframe [duplicate]

I have a 'big' dataset (huge_df) with >20 columns. One of the columns is an id field (generated with pyspark.sql.functions.monotonically_increasing_id()).

Using some criteria I generate a second dataframe (filter_df), consisting of id values I want to filter later on from huge_df.

Currently I am using SQL syntax to do this:

filter_df.createOrReplaceTempView('filter_view')
huge_df = huge_df.where('id NOT IN (SELECT id FROM filter_view)')

Question 1: Is there a way to do this using Python only, i.e. without the need to register the TempView?

Question 2: Is there a completely different way to accomplish the same thing?

like image 602
akoeltringer Avatar asked Apr 19 '17 14:04

akoeltringer


1 Answers

You can use JOIN

huge_df = huge_df.join(filter_df, huge_df.id == filter_df.id, "left_outer")
                 .where(filter_df.id.isNull())
                 .select([col(c) for c in huge_df.columns]

However it will cause expensive shuffle.

Logic is simple: do left join with filter_df on id fields and check if filter_df is null - if it is null, that means that there's no such row in filter_df

like image 121
T. Gawęda Avatar answered Oct 14 '22 16:10

T. Gawęda