Compare two Spark dataframes

Tags:

Spark dataframe 1 -:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193    |
|city 1|prod 2 |8/25/2017|50  |687|201    |
|city 1|prod 3 |9/9/2017 |236 |431|169    |
|city 2|prod 1 |9/28/2017|358 |975|193    |
|city 2|prod 2 |8/24/2017|50  |687|201    |
|city 3|prod 3 |9/8/2017 |236 |431|169    |
+------+-------+---------+----+---+-------+

Spark dataframe 2 -:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193    |
|city 1|prod 2 |8/25/2017|50  |687|201    |
|city 1|prod 3 |9/9/2017 |230 |430|160    |
|city 1|prod 4 |9/27/2017|350 |90 |190    |
|city 2|prod 2 |8/24/2017|50  |687|201    |
|city 3|prod 3 |9/8/2017 |236 |431|169    |
|city 3|prod 4 |9/18/2017|230 |431|169    |
+------+-------+---------+----+---+-------+

Please find out spark dataframe for following conditions applied on above given spark dataframe 1 and spark dataframe 2,

Deleted Records
New Records
Records with no changes
Records with changes

Here key of comprision are 'city', 'product', 'date'.

we need solution without using Spark SQL.

273

asked Aug 07 '17 18:08

prakash

2 Answers

I am not sure about finding the deleted and modified records but you can use except function to get the difference

df2.except(df1)

This returns the rows that has been added or modified in dataframe2 or record with changes. Output:

+------+-------+---------+----+---+-------+
|  city|product|     date|sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 3| prod 4|9/18/2017| 230|431|    169|
|city 1| prod 4|9/27/2017| 350| 90|    190|
|city 1| prod 3|9/9/2017 | 230|430|    160|
+------+-------+---------+----+---+-------+

You can also try with join and filter to get the changed and unchanged data as

df1.join(df2, Seq("city","product", "date"), "left").show(false)
df1.join(df2, Seq("city","product", "date"), "right").show(false)

Hope this helps!

198

answered Sep 23 '22 20:09

koiralo

A scalable and easy way is to diff the two DataFrames with spark-extension:

import uk.co.gresearch.spark.diff._

df1.diff(df2, "city", "product", "date").show

+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
|diff|  city|product|      date|left_sale|right_sale|left_exp|right_exp|left_wastage|right_wastage|
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
|   N|city 1|prod 2 |2017-08-25|       50|        50|     687|      687|         201|          201|
|   C|city 1|prod 3 |2017-09-09|      236|       230|     431|      430|         169|          160|
|   I|city 3|prod 4 |2017-09-18|     null|       230|    null|      431|        null|          169|
|   N|city 3|prod 3 |2017-09-08|      236|       236|     431|      431|         169|          169|
|   D|city 2|prod 1 |2017-09-28|      358|      null|     975|     null|         193|         null|
|   I|city 1|prod 4 |2017-09-27|     null|       350|    null|       90|        null|          190|
|   N|city 1|prod 1 |2017-09-29|      358|       358|     975|      975|         193|          193|
|   N|city 2|prod 2 |2017-08-24|       50|        50|     687|      687|         201|          201|
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+

It identifies Inserted, Changed, Deleted and uN-changed rows.

answered Sep 22 '22 20:09

EnricoM

Related questions
                            
                                Activating conda environment with its full path
                            
                                TypeError: write() argument must be str, not bytes (Python 3 vs Python 2 )
                            
                                What is default username and password for JupyterHub?
                            
                                How does Polymorphism work with Gson (Retrofit)
                            
                                Send sync message from IpcMain to IpcRenderer - Electron
                            
                                How to multiply a matrix by a vector in PyTorch
                            
                                Failed: Can't resolve all parameters for MatDialogRef: (?, ?, ?). unit testing Angular project
                            
                                React native flatlist initial scroll to bottom
                            
                                How to get the last character of a &str?
                            
                                kotlin + Dagger2 : cannot be provided without an @Provides-annotated method
                            
                                Getting error while trying to run this command " pipenv install requests " in mac OS
                            
                                Shapely point geometry in geopandas df to lat/lon columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With