Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is the GROUP BY clause applied after the WHERE clause in Hive?

Tags:

hive

hiveql

Suppose I have the following SQL:

select user_group, count(*)
from table
where user_group is not null
group by user_group

Suppose further that 99% of the data has null user_group.

Will this discard the rows with null before the GROUP BY, or will one poor reducer end up with 99% of the rows that are later discarded?

I hope it is the former. That would make more sense.

Bonus points if you say what will happen by Hive version. We are using 0.11 and migrating to 0.13.

Bonus points if you can point to any documentation that confirms.

like image 797
dfrankow Avatar asked May 27 '15 05:05

dfrankow


1 Answers

Sequence

FROM & JOINs determine & filter rows
WHERE more filters on the rows
GROUP BY combines those rows into groups
HAVING filters groups
SELECT
ORDER BY arranges the remaining rows/groups

The first step is always the FROM clause. In your case, this is pretty straight-forward, because there's only one table, and there aren't any complicated joins to worry about. In a query with joins, these are evaluated in this first step. The joins are assembled to decide which rows to retrieve, with the ON clause conditions being the criteria for deciding which rows to join from each table. The result of the FROM clause is an intermediate result. You could think of this as a temporary table, consisting of combined rows which satisfy all the join conditions. (In your case the temporary table isn't actually built, because the optimizer knows it can just access your table directly without joining to any others.)

The next step is the WHERE clause. In a query with a WHERE clause, each row in the intermediate result is evaluated according to the WHERE conditions, and either discarded or retained. So null will be discarded before going to Group by clause

Next comes the GROUP BY. If there's a GROUP BY clause, the intermediate result is now partitioned into groups, one group for every combination of values in the columns in the GROUP BY clause.

Now comes the HAVING clause. The HAVING clause operates once on each group, and all rows from groups which do not satisfy the HAVING clause are eliminated.

Next comes the SELECT. From the rows of the new intermediate result produced by the GROUP BY and HAVING clauses, the SELECT now assembles the columns it needs.

Finally, the last step is the ORDER BY clause.

like image 94
Kishore Avatar answered Sep 23 '22 08:09

Kishore