I'm new to PySpark, So apoloigies if this is a little simple, I have found other questions that compare dataframes but not one that is like this, therefore I do not consider it to be a duplicate. I'm trying to compare two dateframes with similar structure. The 'name' will be unique, yet the counts could be different.
So if the count is different I would like it to produce a dataframe or a python dictionary. just like below. Any ideas on how I would achieved something like this?
DF1
+-------+---------+
|name | count_1 |
+-------+---------+
| Alice| 1500 |
| Bob| 1000 |
|Charlie| 150 |
| Dexter| 100 |
+-------+---------+
DF2
+-------+---------+
|name | count_2 |
+-------+---------+
| Alice| 1500 |
| Bob| 200 |
|Charlie| 150 |
| Dexter| 10 |
+-------+---------+
To produce the outcome:
Mismatch
+-------+-------------+--------------+
|name | df1_count | df2_count |
+-------+-------------+--------------+
| Bob| 1000 | 200 |
| Dexter| 100 | 10 |
+-------+-------------+--------------+
Match
+-------+-------------+--------------+
|name | df1_count | df2_count |
+-------+-------------+--------------+
| Alice| 1500 | 1500 |
|Charlie| 150 | 150 |
+-------+-------------+--------------+
So I create a third DataFrame, joining DataFrame1 and DataFrame2, and then filter by the counts fields to check if they are equal or not:
Mismatch:
df3 = df1.join(df2, [df1.name == df2.name] , how = 'inner' )
df3.filter(df3.df1_count != df3.df2_count).show()
Match:
df3 = df1.join(df2, [df1.name == df2.name] , how = 'inner' )
df3.filter(df3.df1_count == df3.df2_count).show()
Hope this comes in useful for someone
For small DataFrame comparisons, you can use the chispa library. This is particularly useful when performing DataFrame comparisons in a test suite. For big datasets, the accepted answer that uses a join is the best approach.
In this example, chispa.assert_df_equality(df1, df2)
, will output this error message:
The rows that mismatch are red and the rows that match are blue. This post has more info on testing PySpark code.
There's a cool library called deequ that is good for "data unit tests", but I'm not sure if there is a PySpark implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With