I have a 'big' dataset (huge_df
) with >20 columns. One of the columns is an id
field (generated with pyspark.sql.functions.monotonically_increasing_id()
).
Using some criteria I generate a second dataframe (filter_df
), consisting of id
values I want to filter later on from huge_df
.
Currently I am using SQL syntax to do this:
filter_df.createOrReplaceTempView('filter_view')
huge_df = huge_df.where('id NOT IN (SELECT id FROM filter_view)')
Question 1:
Is there a way to do this using Python only, i.e. without the need to register the TempView
?
Question 2: Is there a completely different way to accomplish the same thing?
You can use JOIN
huge_df = huge_df.join(filter_df, huge_df.id == filter_df.id, "left_outer")
.where(filter_df.id.isNull())
.select([col(c) for c in huge_df.columns]
However it will cause expensive shuffle.
Logic is simple: do left join with filter_df on id fields and check if filter_df is null - if it is null, that means that there's no such row in filter_df
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With