Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two tables for equality in HIVE

I have two tables, table1 and table2. Each with the same columns:

key, c1, c2, c3

I want to check to see if these tables are equal to eachother (they have the same rows). So far I have these two queries (<> = not equal in HIVE):

select count(*) from table1 t1 
left outer join table2 t2
on t1.key=t2.key
where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3

And

select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t2.key is null

So my idea is that, if a zero count is returned, the tables are the same. However, I'm getting a zero count for the first query, and a non-zero count for the second query. How exactly do they differ? If there is a better way to check this certainly let me know.

like image 474
Danzo Avatar asked Aug 04 '15 11:08

Danzo


People also ask

How do you compare two identical tables?

Compare Two Table using MINUS. You can compare the two similar tables or data sets using MINUS operator. It returns all rows in table 1 that do not exist or changed in the other table.

How do you know if two tables are equal?

SELECT count(*) FROM ( ( SELECT * FROM table_A MINUS SELECT * FROM table_B ) UNION ( SELECT * FROM table_B MINUS SELECT * FROM table_A ) ); If the count is different from zero the tables are not identical!

How can I compare two tables in different columns in SQL?

Using joins to compare columns by priority among the table. For example, left join returns all values from the first table and null value for the not-matched records from the second table. Similarly, we can use right join, inner join, full join and self join as per our requirements.


2 Answers

Well, the best way is calculate the hash sum of each table, and compare the sum of hash. So no matter how many column are they, no matter what data type are they, as long as the two table has the same schema, you can use following query to do the comparison:

select sum(hash(*)) from t1;
select sum(hash(*)) from t2;

And you just need to compare the return values.

like image 188
Youjun Yuan Avatar answered Sep 17 '22 17:09

Youjun Yuan


If you want to check for duplicates and the tables have exactly the same structure and the tables do not have duplicates within them, then you can do:

select t.key, t.c1, t.c2, t.c3, count(*) as cnt
from ((select t1.*, 1 as which from table1 t1) union all
      (select t2.*, 2 as which from table2 t2)
     ) t
group by t.key, t.c1, t.c2, t.c3
having cnt <> 2;

There are various ways that you can relax the conditions in the first paragraph, if necessary.

Note that this version also works when the columns have NULL values. These might be causing the problem with your data.

like image 20
Gordon Linoff Avatar answered Sep 19 '22 17:09

Gordon Linoff