In Spark version 1.2.0 one could use <code>subtract</code> with 2 <code>SchemRDD</code>s to end up with only the different content from the first one <pre class="prettyprint"><code>val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD) </code></pre> <code>onlyNewData</code> contains the rows in <code>todaySchemRDD</code> that do not exist in <code>yesterdaySchemaRDD</code>. How can this be achieved with <code>DataFrames</code> in Spark version 1.3.0?

According to the Scala API docs, doing: <pre class="prettyprint"><code>dataFrame1.except(dataFrame2) </code></pre> will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.

Spark: subtract two DataFrames

Tags:

dataframe

apache-spark

rdd

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one

Click to copy

val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)

onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD.

How can this be achieved with DataFrames in Spark version 1.3.0?

310

asked Apr 09 '15 11:04

Interfector

1 Answers

According to the Scala API docs, doing:

Click to copy

dataFrame1.except(dataFrame2)

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.

164

answered Sep 19 '22 13:09

Eric Eijkelenboom

Related questions
                            
                                aggregate function Count usage with groupBy in Spark
                            
                                What are the various join types in Spark?
                            
                                How does Spark partition(ing) work on files in HDFS?
                            
                                How to melt Spark DataFrame?
                            
                                How to check Spark Version [closed]
                            
                                Generate a Spark StructType / Schema from a case class
                            
                                Spark functions vs UDF performance?
                            
                                How to access s3a:// files from Apache Spark?
                            
                                PySpark - rename more than one column using withColumnRenamed
                            
                                How do I log from my Python Spark script
                            
                                PySpark: java.lang.OutofMemoryError: Java heap space
                            
                                Retrieve top n in each group of a DataFrame in pyspark
                            
                                PySpark: How to fillna values in dataframe for specific columns?
                            
                                How to convert a DataFrame back to normal RDD in pyspark?
                            
                                How to import multiple csv files in a single load?
                            
                                How to list all cassandra tables
                            
                                What is the concept of application, job, stage and task in spark?
                            
                                How to query JSON data column using Spark DataFrames?
                            
                                How to aggregate values into collection after groupBy?
                            
                                "Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory