How to implement NOT IN for two DataFrames with different structure in Apache Spark

Tags:

I am using Apache Spark in my Java application. I have two DataFrames: df1 and df2. The df1 contains Rows with email, firstName and lastName. df2 contains Rows with email.

I want to create a DataFrame: df3 that contains all the rows in df1, which email is not present in df2.

Is there a way to do this with Apache Spark? I tried to create JavaRDD<String> from df1 and df2 by casting them toJavaRDD() and filtering df1 to containing all emails and after that using subtract, but I don't know how to map the new JavaRDD to ds1 and get a DataFrame.

Basically I need all Rows that are in df1 whose email is not in df2.

DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM customer ");

DataFrame customersWhoOrderedTheProduct = sqlContext.cassandraSql("SELECT email FROM customer_bought_product " +
                            "WHERE product_id = '" + productId + "'");

JavaRDD<String> customersBoughtEmail = customersWhoOrderedTheProduct.toJavaRDD().map(row -> row.getString(0));

List<String> notBoughtEmails = customers.javaRDD()
                        .map(row -> row.getString(0))
                        .subtract(customersBoughtEmail).collect();

465

asked Nov 11 '15 13:11

Ivan Stoyanov

1 Answers

Spark 2.0.0+

You can use NOT IN directly.

Spark < 2.0.0

It can be expressed using outer join and filter.

val customers = sc.parallelize(Seq(
  ("[email protected]", "John", "Doe"),
  ("[email protected]", "Jane", "Doe")
)).toDF("email", "first_name", "last_name")

val customersWhoOrderedTheProduct = sc.parallelize(Seq(
  Tuple1("[email protected]")
)).toDF("email")

val customersWhoHaventOrderedTheProduct = customers.join(
    customersWhoOrderedTheProduct.select($"email".alias("email_")),
    $"email" === $"email_", "leftouter")
 .where($"email_".isNull).drop("email_")

customersWhoHaventOrderedTheProduct.show

// +----------------+----------+---------+
// |           email|first_name|last_name|
// +----------------+----------+---------+
// |[email protected]|      John|      Doe|
// +----------------+----------+---------+

Raw SQL equivalent:

customers.registerTempTable("customers")
customersWhoOrderedTheProduct.registerTempTable(
  "customersWhoOrderedTheProduct")

val query = """SELECT c.* FROM customers c LEFT OUTER JOIN  
                 customersWhoOrderedTheProduct o
               ON c.email = o.email
               WHERE o.email IS NULL"""

sqlContext.sql(query).show

// +----------------+----------+---------+
// |           email|first_name|last_name|
// +----------------+----------+---------+
// |[email protected]|      John|      Doe|
// +----------------+----------+---------+

130

answered Nov 14 '22 23:11

zero323

Related questions
                            
                                rabbitmq consume json message and convert into Java object
                            
                                Open A Local HTML file using selenium webdriver [duplicate]
                            
                                How parse json array with multiple objects by gson?
                            
                                JAVA FXCollections LoadException Class is not a valid type
                            
                                Is Prototype an anti pattern? [closed]
                            
                                Mocking type-casting objects
                            
                                Entities with @RequiredArgsConstructor error in IntelliJ IDE
                            
                                locked by transaction: @console:Oracle (INTELLIJ CLIENT)
                            
                                Is the Class object A created when the JVM loads class A, or when I call A.class?
                            
                                Module not specified?
                            
                                Android get Serial Number
                            
                                Why is a thread blocking my JavaFX UI Thread?
                            
                                How manage transaction between domain logic and events in DDD?
                            
                                Generics type erasure in Java
                            
                                How would I find the time-complexity of a recursive method in Java?
                            
                                The default package '.' is not permitted by the Import-Package syntax
                            
                                Spring with JUnit Testing and Dependency Injection does not work
                            
                                How to forward a REST request to another resource?
                            
                                MenuBar Icon for Dark Mode on OS X in Java
                            
                                JPA Criteria API: Difference between two dates in database?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to implement NOT IN for two DataFrames with different structure in Apache Spark

Tags:

java

sql

apache-spark

apache-spark-sql

Ivan Stoyanov

People also ask

1 Answers

zero323

Recent Activity

Donate For Us