Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive command to execute NOT IN clause

Tags:

hadoop

hive

I have two tables,tab1 & tab2.

tab1(T1)  tab2(T2)
a1         b1
b1         c1
c1         f1
d1         g1

I am looking for the values from table T1 that are not present in T2. In this case, the output should be a1 d1

I have tried with the following query but couldn't get the right solution.

select distinct tab1.T1 from tab1 left semi join tab2 on (tab1.T1!=tab2.T2);
like image 515
Blue Diamond Avatar asked Jan 02 '14 09:01

Blue Diamond


People also ask

Can we use not in in Hive?

Though the IN operator in hive works just fine in this case as well but NOT IN doesn't. If I replace NOT IN with IN operator, that works. Inact using NOT IN with a list of strings specified works too but it somehow does not work with select statement.

How do I write not a query in Hive?

When using NOT IN you should add is not null to the inner query, unless you are 100% sure that the relevant column does not contain null values. One null value is enough to cause your query to return no results.

How do I exclude columns in Hive?

The easiest way to select specific columns in the Hive query is by specifying the column name in the select statement. SELECT col1, col3, col4 .... FROM Table1; But imagine your table contains many columns (i.e : more than 100 columns) and you need to only exclude a few columns in the select statement.

Does Hive support in clause?

Hive supports subqueries only in the FROM clause (through Hive 0.12). The subquery has to be given a name because every table in a FROM clause must have a name. Columns in the subquery select list must have unique names.


1 Answers

SELECT t1.str
FROM tab1 t1 
LEFT OUTER JOIN tab2 t2 ON t1.str = t2.str
WHERE t2.str IS NULL;

Result:

OK
a1
d1

"Why is the t2.str is null condition there": Left outer joins ensure that all values from the first table are included in the results. So then what happens when there are no values in the second table: in that case all of the columns from the second table are reported as null.

So in the case above we are searching precisely for the condition that the second table entries are missing - and thus we:

  • Choose one of the never-empty (aka not null) columns from table two.
  • So: is number an always-present column? If not then please choose another one
  • Specify the condition "table1-alias"."table1-never-null-column" = null. That means that the record is actually not present in the join condition - and thus we found the records existing only in table 1.
like image 128
WestCoastProjects Avatar answered Oct 02 '22 05:10

WestCoastProjects