Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write scala unit tests to compare spark dataframes?

Purpose - Checking if a dataframe generated by spark and a manually created dataframe are the same.

Earlier implementation which worked -

if (da.except(ds).count() != 0 && ds.except(da).count != 0)

Boolean returned - true

Where da and ds are the generated dataframe and the created dataframe respectively.

Here I am running the program via the spark-shell.

Newer Implementation which doesn't work -

assert (da.except(ds).count() != 0 && ds.except(da).count != 0)

Boolean returned - false

Where da and ds are the generated dataframe and the created dataframe respectively.

Here I am using the assert method of scalatest instead, but the returned result is not returning as true.

Why try to use the new implementation when previous method worked? To have sbt use scalatest to always run the test file via sbt test or while compiling.

The same code to compare spark dataframes when run via the spark-shell, gives the correct output but shows an error when run using scalatest in sbt.

The two programs are effectively the same but the results are different. What could be the problem?

like image 542
Pratyush Das Avatar asked Mar 28 '26 23:03

Pratyush Das


1 Answers

Tests for compare dataframes exists in Spark Core, example: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala

Libraries with tests shared code (SharedSQLContext, ect) present in central Maven repo, you can include them in project, and use "checkAnswer" methods for compare dataframes.

like image 125
pasha701 Avatar answered Apr 02 '26 15:04

pasha701



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!