Purpose - Checking if a dataframe generated by spark and a manually created dataframe are the same.
Earlier implementation which worked -
if (da.except(ds).count() != 0 && ds.except(da).count != 0)
Boolean returned - true
Where da and ds are the generated dataframe and the created dataframe respectively.
Here I am running the program via the spark-shell.
Newer Implementation which doesn't work -
assert (da.except(ds).count() != 0 && ds.except(da).count != 0)
Boolean returned - false
Where da and ds are the generated dataframe and the created dataframe respectively.
Here I am using the assert method of scalatest instead, but the returned result is not returning as true.
Why try to use the new implementation when previous method worked? To have sbt use scalatest to always run the test file via sbt test or while compiling.
The same code to compare spark dataframes when run via the spark-shell, gives the correct output but shows an error when run using scalatest in sbt.
The two programs are effectively the same but the results are different. What could be the problem?
Tests for compare dataframes exists in Spark Core, example: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
Libraries with tests shared code (SharedSQLContext, ect) present in central Maven repo, you can include them in project, and use "checkAnswer" methods for compare dataframes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With