Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement `except` in Apache Spark based on subset of columns?

I am working with two schema in spark, table1 and table2:

scala> table1.printSchema
root
 |-- user_id: long (nullable = true)
 |-- item_id: long (nullable = true)
 |-- value: double (nullable = true)

scala> table2.printSchema
root
 |-- item_id: long (nullable = true)
 |-- user_id: long (nullable = true)
 |-- value: double (nullable = true)

However, I have created these two from different sources. Basically each of them is holding a value information for (user_id, item_id) pair which is a floating point data type, and as such, prone to floating point errors. For example (1, 3, 4) in one table can be stored as (1, 3, 3.9998..) in another due to other calculations.

I need remove rows with (user_id, item_id) pair (guaranteed to be pair-wise unique) from table1 which are also present in table2. Something like this:

scala> table1.except(table2)

However, there is no way to tell except when it should determine two rows to be same, which in this case is just (user_id, item_id). I need to disregard value for this.

How to do this using spark-sql?

like image 892
Zobayer Hasan Avatar asked Mar 14 '18 05:03

Zobayer Hasan


People also ask

How do I select all columns except one in Spark SQL?

You can use drop() method in the DataFrame API to drop a particular column and then select all the columns.

How do you exclude columns in PySpark?

In pyspark the drop() function can be used to remove values/columns from the dataframe. thresh – This takes an integer value and drops rows that have less than that thresh hold non-null values. By default it is set to 'None'.

How do I select certain columns in Spark DataFrame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

How does except work in Spark?

EXCEPT and EXCEPT ALL return the rows that are found in one relation but not the other. EXCEPT (alternatively, EXCEPT DISTINCT ) takes only distinct rows while EXCEPT ALL does not remove duplicates from the result rows.


1 Answers

Using a leftanti join would be a possible solution. This will remove rows from the left table that are present in the right table for the given key.

table1.join(table2, Seq("user_id", "item_id"), "leftanti")
like image 96
Shaido Avatar answered Oct 12 '22 01:10

Shaido