Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare two dataframe and print columns that are different in scala

Tags:

We have two data frames here:

the expected dataframe:

+------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+---------+--------+----------+-------+--------+ |     3|  Chennai|  rahman|9848022330|  45000|SanRamon| |     1|Hyderabad|     ram|9848022338|  50000|      SF| |     2|Hyderabad|   robin|9848022339|  40000|      LA| |     4|  sanjose|   romin|9848022331|  45123|SanRamon| +------+---------+--------+----------+-------+--------+ 

and the actual data frame:

+------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+---------+--------+----------+-------+--------+ |     3|  Chennai|  rahman|9848022330|  45000|SanRamon| |     1|Hyderabad|     ram|9848022338|  50000|      SF| |     2|Hyderabad|   robin|9848022339|  40000|      LA| |     4|  sanjose|  romino|9848022331|  45123|SanRamon| +------+---------+--------+----------+-------+--------+ 

the difference between the two dataframes now is:

+------+--------+--------+----------+-------+--------+ |emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+--------+--------+----------+-------+--------+ |     4| sanjose|  romino|9848022331|  45123|SanRamon| +------+--------+--------+----------+-------+--------+ 

We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.

like image 509
rominoushana Avatar asked Jun 02 '17 22:06

rominoushana


People also ask

What is the difference between == and === in Scala?

An important difference for Spark is the return value. For Column: == returns a boolean. === returns a column (which contains the result of the comparisons of the elements of two columns)

How do you find the difference between two data frames in spark?

Pretty simple. Use the except() to subtract or find the difference between two dataframes.

How do I compare two DataFrames in pandas?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.

What is DataFrame in Scala?

In Scala and Java, a DataFrame is represented by a Dataset of Row s. In the Scala API, DataFrame is simply a type alias of Dataset[Row] . While, in Java API, users need to use Dataset<Row> to represent a DataFrame . Throughout this document, we will often refer to Scala/Java Datasets of Row s as DataFrames.


2 Answers

From the scenario that is described in the above question, it looks like that difference has to found between columns and not rows.

So, in order to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.

Now, to apply selective difference we have to write code something like this:

  1. First we need to find the columns in expected and actual dataframes.

    val columns = df1.schema.fields.map(_.name)

  2. Then we have to find difference columnwise.

    val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))

  3. At last we need to find out which columns contains different values.

    selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

And, we will get only the columns which contains different values. Like this:

+--------+ |emp_name| +--------+ |  romino| +--------+ 

I hope this helps!

like image 111
himanshuIIITian Avatar answered Sep 23 '22 05:09

himanshuIIITian


 list_col=[] cols=df1.columns  # Prepare list of dataframes/per column for col in cols:   list_col.append(df1.select(col).subtract(df2.select(col)))  # Render/persist for  l in list_col :   if l.count() > 0 :      l.show() 
like image 42
vivek mishra Avatar answered Sep 22 '22 05:09

vivek mishra