Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: subtract two DataFrames

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one

val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD) 

onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD.

How can this be achieved with DataFrames in Spark version 1.3.0?

like image 310
Interfector Avatar asked Apr 09 '15 11:04

Interfector


People also ask

Can we subtract two DataFrames in PySpark?

Pretty simple. Use the except() to subtract or find the difference between two dataframes.

What is subtract in PySpark?

subtract() as applied on two RDDs. It is used to return the elements present in the first RDD but not present in the second. RDD. distinct() is applied on single RDD that is used to return unique elements from the RDD.

Can we use minus in spark SQL?

Spark SQL supports three types of set operators: EXCEPT or MINUS.


1 Answers

According to the Scala API docs, doing:

dataFrame1.except(dataFrame2) 

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.

like image 164
Eric Eijkelenboom Avatar answered Sep 19 '22 13:09

Eric Eijkelenboom