Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use "where not exists" SQL condition in pyspark?

I have a table on Hive and I am trying to insert data in that table.
I am taking data from SQL but I don't want to insert id which already exists in the Hive table. I am trying to use the same condition like where not exists. I am using PySpark on Airflow.

like image 379
user3423920 Avatar asked Oct 21 '25 12:10

user3423920


1 Answers

The exists operator doesn't exist in Spark but there are 2 join operators that can replace it : left_anti and left_semi.

If you want for example to insert a dataframe df in a hive table target, you can do :

new_df = df.join(
    spark.table("target"),
    how='left_anti',
    on='id'
)

then you write new_df in your table.

left_anti allows you to keep only the lines which do not meet the join condition (equivalent of not exists). The equivalent of exists is left_semi.

like image 79
Steven Avatar answered Oct 23 '25 03:10

Steven



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!