How to implement `except` in Apache Spark based on subset of columns?

Tags:

I am working with two schema in spark, table1 and table2:

scala> table1.printSchema
root
 |-- user_id: long (nullable = true)
 |-- item_id: long (nullable = true)
 |-- value: double (nullable = true)

scala> table2.printSchema
root
 |-- item_id: long (nullable = true)
 |-- user_id: long (nullable = true)
 |-- value: double (nullable = true)

However, I have created these two from different sources. Basically each of them is holding a value information for (user_id, item_id) pair which is a floating point data type, and as such, prone to floating point errors. For example (1, 3, 4) in one table can be stored as (1, 3, 3.9998..) in another due to other calculations.

I need remove rows with (user_id, item_id) pair (guaranteed to be pair-wise unique) from table1 which are also present in table2. Something like this:

scala> table1.except(table2)

However, there is no way to tell except when it should determine two rows to be same, which in this case is just (user_id, item_id). I need to disregard value for this.

How to do this using spark-sql?

892

asked Mar 14 '18 05:03

Zobayer Hasan

1 Answers

Using a leftanti join would be a possible solution. This will remove rows from the left table that are present in the right table for the given key.

table1.join(table2, Seq("user_id", "item_id"), "leftanti")

answered Oct 12 '22 01:10

Shaido

Related questions
                            
                                Apache Spark UDF that returns dynamic data types
                            
                                Convert try to option without losing error information in Scala
                            
                                Slick - Inserting a row into two tables linked with an auto-incrementing key?
                            
                                Resolving types in F-bounded polymorphism
                            
                                akka-http: send element to akka sink from http route
                            
                                Scala method to side effect on map and return it
                            
                                Do something when exactly one option is non-empty
                            
                                NoClassDefFoundError: Could not initialize XXX class after deploying on spark standalone cluster
                            
                                is it possible to implement flip as a Scala function (and not a method)
                            
                                Error while adding slick dependency using sbt
                            
                                In a scala object, is an immutable val thread safe?
                            
                                WS in Play become incredible complex for 2.6.X
                            
                                Scala get the line and file of a functions invocation at compile-time
                            
                                What is the use of FastFuture in akka
                            
                                java.sql.Timestamp wrong time parsing
                            
                                Error: Could not write class iw because it exceeds JVM code size limits. Method code too large
                            
                                KafkaIO checkpoint - how to commit offsets to Kafka
                            
                                Scala Reference In Wedding Reading
                            
                                How to parse JSON data in Scala?
                            
                                Scala: How to combine two data frames?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to implement `except` in Apache Spark based on subset of columns?

Tags:

scala

apache-spark

apache-spark-sql

Zobayer Hasan

People also ask

1 Answers

Shaido

Recent Activity

Donate For Us