DataFrame equality in Apache Spark

Tags:

Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API.

Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic), where equivalence is determined by the data (column names and column values for each row) being identical save for the ordering of rows & columns?

The motivation for the question is that there are often many ways to compute some big data result, each with its own trade-offs. As one explores these trade-offs, it is important to maintain correctness and hence the need to check for the equivalence/equality on a meaningful test data set.

940

asked Jul 03 '15 02:07

Sim

1 Answers

Scala (see below for PySpark)

The spark-fast-tests library has two methods for making DataFrame comparisons (I'm the creator of the library):

The assertSmallDataFrameEquality method collects DataFrames on the driver node and makes the comparison

def assertSmallDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {   if (!actualDF.schema.equals(expectedDF.schema)) {     throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))   }   if (!actualDF.collect().sameElements(expectedDF.collect())) {     throw new DataFrameContentMismatch(contentMismatchMessage(actualDF, expectedDF))   } }

The assertLargeDataFrameEquality method compares DataFrames spread on multiple machines (the code is basically copied from spark-testing-base)

def assertLargeDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {   if (!actualDF.schema.equals(expectedDF.schema)) {     throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))   }   try {     actualDF.rdd.cache     expectedDF.rdd.cache      val actualCount = actualDF.rdd.count     val expectedCount = expectedDF.rdd.count     if (actualCount != expectedCount) {       throw new DataFrameContentMismatch(countMismatchMessage(actualCount, expectedCount))     }      val expectedIndexValue = zipWithIndex(actualDF.rdd)     val resultIndexValue = zipWithIndex(expectedDF.rdd)      val unequalRDD = expectedIndexValue       .join(resultIndexValue)       .filter {         case (idx, (r1, r2)) =>           !(r1.equals(r2) || RowComparer.areRowsEqual(r1, r2, 0.0))       }      val maxUnequalRowsToShow = 10     assertEmpty(unequalRDD.take(maxUnequalRowsToShow))    } finally {     actualDF.rdd.unpersist()     expectedDF.rdd.unpersist()   } }

assertSmallDataFrameEquality is faster for small DataFrame comparisons and I've found it sufficient for my test suites.

PySpark

Here's a simple function that returns true if the DataFrames are equal:

def are_dfs_equal(df1, df2):     if df1.schema != df2.schema:         return False     if df1.collect() != df2.collect():         return False     return True

You'll typically perform DataFrame equality comparisons in a test suite and will want a descriptive error message when the comparisons fail (a True / False return value doesn't help much when debugging).

Use the chispa library to access the assert_df_equality method that returns descriptive error messages for test suite workflows.

144

answered Oct 06 '22 02:10

Powers

Related questions
                            
                                Elements of Scala Style? [closed]
                            
                                Scala: XML Whitespace Removal?
                            
                                Why is this Scala code with assignment of a val in a parameter working?
                            
                                Migrating from Maven to SBT
                            
                                Why Future.sequence executes my futures in parallel rather than in series?
                            
                                Why no immutable arrays in scala standard library?
                            
                                Scala return type for tuple-functions
                            
                                Match "fallthrough": executing same piece of code for more than one case?
                            
                                Instantiating a case class from a list of parameters
                            
                                What's the deal with all the Either cruft?
                            
                                Are there any methods included in Scala to convert tuples to lists?
                            
                                How can I execute multiple tasks in Scala?
                            
                                Getting a Scala interpreter to work
                            
                                convert a byte array to string
                            
                                Trimming strings in Scala
                            
                                Using .tupled method when companion object is in class
                            
                                Base 64 encoding with Scala or Java
                            
                                Is Scala functional programming slower than traditional coding?
                            
                                Difference between trait inheritance and self type annotation
                            
                                Difference Await.ready and Await.result

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

DataFrame equality in Apache Spark

Tags:

dataframe

scala

apache-spark

rdd

apache-spark-sql

Sim

People also ask

1 Answers

Powers

Recent Activity

Donate For Us