Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pytest assert for pyspark dataframe comparison

Tags:

I have 2 pyspark dataframe as shown in file attached. expected_df and actual_df

enter image description here

In my unit test I am trying to check if both are equal or not.

for which my code is

expected = map(lambda row: row.asDict(), expected_df.collect()) 
actual = map(lambda row: row.asDict(), actaual_df.collect()) 
assert expected = actual 

Since both dfs are same but row order is different so assert fails here. What is best way to compare such dfs.

like image 737
Bharat Sharma Avatar asked Oct 03 '18 02:10

Bharat Sharma


People also ask

What is assert in PySpark?

The assert keyword is used when debugging code. The assert keyword lets you test if a condition in your code returns True, if not, the program will raise an AssertionError.

How do you convert PySpark DF to pandas DF?

Convert PySpark Dataframe to Pandas DataFramePySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data.


2 Answers

You can try pyspark-test

https://pypi.org/project/pyspark-test/

This is inspired by the panadas testing module build for pyspark.

Usage is simple

from pyspark_test import assert_pyspark_df_equal

assert_pyspark_df_equal(df_1, df_2)

Also apart from just comparing dataframe, just like the pandas testing module it also accepts many optional params that you can check in the documentation.

Note:

  1. The datatypes in pandas and pysaprk are bit different, thats why directly converting to .toPandas and using panadas testing module might not be the right approach.
  2. This package is for unit/integration testing, so meant to be used with small size dfs
like image 141
Rahul Kumar Avatar answered Oct 04 '22 22:10

Rahul Kumar


This is done in some of the pyspark documentation:

assert sorted(expected_df.collect()) == sorted(actaual_df.collect())

like image 26
Luis Meraz Avatar answered Oct 04 '22 23:10

Luis Meraz