I am working with two schema in spark, table1
and table2
:
scala> table1.printSchema
root
|-- user_id: long (nullable = true)
|-- item_id: long (nullable = true)
|-- value: double (nullable = true)
scala> table2.printSchema
root
|-- item_id: long (nullable = true)
|-- user_id: long (nullable = true)
|-- value: double (nullable = true)
However, I have created these two from different sources. Basically each of them is holding a value
information for (user_id
, item_id
) pair which is a floating point data type, and as such, prone to floating point errors. For example (1, 3, 4) in one table can be stored as (1, 3, 3.9998..) in another due to other calculations.
I need remove rows with (user_id
, item_id
) pair (guaranteed to be pair-wise unique) from table1
which are also present in table2
. Something like this:
scala> table1.except(table2)
However, there is no way to tell except when it should determine two rows to be same, which in this case is just (user_id
, item_id
). I need to disregard value
for this.
How to do this using spark-sql?
You can use drop() method in the DataFrame API to drop a particular column and then select all the columns.
In pyspark the drop() function can be used to remove values/columns from the dataframe. thresh – This takes an integer value and drops rows that have less than that thresh hold non-null values. By default it is set to 'None'.
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
EXCEPT and EXCEPT ALL return the rows that are found in one relation but not the other. EXCEPT (alternatively, EXCEPT DISTINCT ) takes only distinct rows while EXCEPT ALL does not remove duplicates from the result rows.
Using a leftanti join
would be a possible solution. This will remove rows from the left table that are present in the right table for the given key.
table1.join(table2, Seq("user_id", "item_id"), "leftanti")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With