I have 2 pyspark dataframe as shown in file attached. expected_df and actual_df
In my unit test I am trying to check if both are equal or not.
for which my code is
expected = map(lambda row: row.asDict(), expected_df.collect())
actual = map(lambda row: row.asDict(), actaual_df.collect())
assert expected = actual
Since both dfs are same but row order is different so assert fails here. What is best way to compare such dfs.
The assert keyword is used when debugging code. The assert keyword lets you test if a condition in your code returns True, if not, the program will raise an AssertionError.
Convert PySpark Dataframe to Pandas DataFramePySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data.
You can try pyspark-test
https://pypi.org/project/pyspark-test/
This is inspired by the panadas testing module build for pyspark.
Usage is simple
from pyspark_test import assert_pyspark_df_equal
assert_pyspark_df_equal(df_1, df_2)
Also apart from just comparing dataframe, just like the pandas testing module it also accepts many optional params that you can check in the documentation.
Note:
.toPandas
and using panadas testing module might not be the right approach.This is done in some of the pyspark documentation:
assert sorted(expected_df.collect()) == sorted(actaual_df.collect())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With