Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subquery in `where` with comparison operator

Let's say I have a large table partitioned by dt field. I want to query this table for data after specific date. E.g.

select * from mytab where dt >= 20140701;

The tricky part is that date is not a constant, but comes from a subquery. So basically I want something like this:

select * from mytab where dt >= (select min(dt) from activedates);

Hive can't do it, however, giving me ParseException on subquery (from docs I'm guessing it's just not supported yet).

So how do I restrict my query based on dynamic subquery?

Note, that performance is key point here. So the faster, the better, even if it looks uglier.

Also note, that we haven't switched to Hive 0.13 yet, so solutions without in query are preferred.

like image 546
ffriend Avatar asked Jul 08 '14 11:07

ffriend


People also ask

Can subquery be used with comparison operator?

Subqueries can be used with SELECT, UPDATE, INSERT, DELETE statements along with expression operator. It could be equality operator or comparison operator such as =, >, =, <= and Like operator. A subquery is a query within another query.

Can you use a subquery in a WHERE clause?

A subquery can be nested inside the WHERE or HAVING clause of an outer SELECT , INSERT , UPDATE , or DELETE statement, or inside another subquery.

Can multiple-row subquery be used with comparison operator?

Multiple-row subqueries are used most commonly in WHERE and HAVING clauses. Since it returns multiple rows, it must be handled by set comparison operators (IN, ALL, ANY).


1 Answers

Hive decides on the partition pruning when building the execution plan and thus has to have the value of the max(dt) prior to execution.

Currently the only way to accomplish something like this is breaking the query into two parts, when the first will be select min(dt) from activedates, its results will be put into a variable.
2nd query will be : select * from mytab where dt >=${hiveconf:var}.

Now this is a bit tricky.
You could either execute the 1st query into OS variable like so :

a=`hive -S -e "select min(dt) from activedates"`

And then run the 2nnd query like so :

hive -hiveconf var=$a -e "select * from mytab where dt >=${hiveconf:var}"

or event just :

hive -e "select * from mytab where dt >=$a"

Or, if you are using some other scripting language you can replace the variable in the code.

like image 154
dimamah Avatar answered Sep 30 '22 15:09

dimamah