How to compare two dataframe and print columns that are different in scala

Tags:

We have two data frames here:

the expected dataframe:

+------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+---------+--------+----------+-------+--------+ |     3|  Chennai|  rahman|9848022330|  45000|SanRamon| |     1|Hyderabad|     ram|9848022338|  50000|      SF| |     2|Hyderabad|   robin|9848022339|  40000|      LA| |     4|  sanjose|   romin|9848022331|  45123|SanRamon| +------+---------+--------+----------+-------+--------+

and the actual data frame:

+------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+---------+--------+----------+-------+--------+ |     3|  Chennai|  rahman|9848022330|  45000|SanRamon| |     1|Hyderabad|     ram|9848022338|  50000|      SF| |     2|Hyderabad|   robin|9848022339|  40000|      LA| |     4|  sanjose|  romino|9848022331|  45123|SanRamon| +------+---------+--------+----------+-------+--------+

the difference between the two dataframes now is:

+------+--------+--------+----------+-------+--------+ |emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+--------+--------+----------+-------+--------+ |     4| sanjose|  romino|9848022331|  45123|SanRamon| +------+--------+--------+----------+-------+--------+

We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.

509

asked Jun 02 '17 22:06

rominoushana

2 Answers

From the scenario that is described in the above question, it looks like that difference has to found between columns and not rows.

So, in order to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.

Now, to apply selective difference we have to write code something like this:

First we need to find the columns in expected and actual dataframes.

val columns = df1.schema.fields.map(_.name)
Then we have to find difference columnwise.

val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
At last we need to find out which columns contains different values.

selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

And, we will get only the columns which contains different values. Like this:

+--------+ |emp_name| +--------+ |  romino| +--------+

I hope this helps!

111

answered Sep 23 '22 05:09

himanshuIIITian

 list_col=[] cols=df1.columns  # Prepare list of dataframes/per column for col in cols:   list_col.append(df1.select(col).subtract(df2.select(col)))  # Render/persist for  l in list_col :   if l.count() > 0 :      l.show()

answered Sep 22 '22 05:09

vivek mishra

Related questions
                            
                                Is there a way to reconstruct a tuple to a compatible class?
                            
                                addUIInterruptionMonitor(withDescription:handler:) not working on iOS 10 or 9
                            
                                Running 'goimports' on save in GoLand
                            
                                Multiple Docker containers and Celery
                            
                                How do I split a string into a List of chars in F sharp
                            
                                node.js: how to use setInterval and clearInterval?
                            
                                Calculate mean by group using dplyr package [duplicate]
                            
                                Why is using a loop to iterate from start of array to end faster than iterating both start to end and end to start?
                            
                                Why does format("kafka") fail with "Failed to find data source: kafka." (even with uber-jar)?
                            
                                Set the precision for Decimal numbers in C#
                            
                                Postgres image is not creating database
                            
                                Mock a function from another file - Jest

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With