PySpark

Question

I'm new to PySpark, So apoloigies if this is a little simple, I have found other questions that compare dataframes but not one that is like this, therefore I do not consider it to be a duplicate. I'm trying to compare two dateframes with similar structure. The 'name' will be unique, yet the counts could be different.

So if the count is different I would like it to produce a dataframe or a python dictionary. just like below. Any ideas on how I would achieved something like this?

DF1

+-------+---------+
|name   | count_1 |
+-------+---------+
|  Alice|   1500  |
|    Bob|   1000  |
|Charlie|   150   |
| Dexter|   100   |
+-------+---------+

DF2

+-------+---------+
|name   | count_2 |
+-------+---------+
|  Alice|   1500  |
|    Bob|   200   |
|Charlie|   150   |
| Dexter|   10    |
+-------+---------+

To produce the outcome:

Mismatch

+-------+-------------+--------------+
|name   | df1_count   | df2_count    |
+-------+-------------+--------------+
|    Bob|   1000      |    200       |
| Dexter|   100       |     10       |
+-------+-------------+--------------+

Match

+-------+-------------+--------------+
|name   | df1_count   | df2_count    |
+-------+-------------+--------------+
|  Alice|   1500      |   1500       |
|Charlie|   150       |    150       |
+-------+-------------+--------------+

user10176078 · Accepted Answer

So I create a third DataFrame, joining DataFrame1 and DataFrame2, and then filter by the counts fields to check if they are equal or not:

Mismatch:

df3 = df1.join(df2, [df1.name == df2.name] , how = 'inner' )
df3.filter(df3.df1_count != df3.df2_count).show()

Match:

df3 = df1.join(df2, [df1.name == df2.name] , how = 'inner' )
df3.filter(df3.df1_count == df3.df2_count).show()

Hope this comes in useful for someone

Powers · Answer

For small DataFrame comparisons, you can use the chispa library. This is particularly useful when performing DataFrame comparisons in a test suite. For big datasets, the accepted answer that uses a join is the best approach.

In this example, chispa.assert_df_equality(df1, df2), will output this error message:

enter image description here

The rows that mismatch are red and the rows that match are blue. This post has more info on testing PySpark code.

There's a cool library called deequ that is good for "data unit tests", but I'm not sure if there is a PySpark implementation.

PySpark - Compare DataFrames

Tags:

python

dataframe

apache-spark

apache-spark-sql

user10176078

2 Answers

user10176078

Powers

Recent Activity

Donate For Us

PySpark - Compare DataFrames

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

user10176078

2 Answers

user10176078

Powers

Related questions

Recent Activity

Donate For Us