Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Windowing function in Hive

I am exploring windowing functions in Hive and I am able to understand the functionalities of all the UDFs. Although, I am not able to understand the partition by and order by that we use with the other functions. Following is the structure that is very similar to the query which I am planning to build.

SELECT a, RANK() OVER(partition by b order by c) as d from xyz; 

Just trying to understand the background process involved for both keywords.

Appreciate the help :)

like image 352
satish silveri Avatar asked Apr 29 '19 18:04

satish silveri


People also ask

What does a windowing function do?

Window functions perform calculations on a set of rows that are related together. But, unlike the aggregate functions, windowing functions do not collapse the result of the rows into a single value. Instead, all the rows maintain their original identity and the calculated result is returned for every row.

Does Hive support window functions?

This section introduces the Hive QL enhancements for windowing and analytics functions. The JIRA HIVE-896 has more information, including links to documentation in the initial comments. All of the windowing and analytics functions operate as per the SQL standard. The number of rows to lead can optionally be specified.

What is spark windowing?

Window functions are useful for processing tasks such as calculating a moving average, computing a cumulative statistic, or accessing the value of rows given the relative position of the current row.


1 Answers

RANK() analytic function assigns a rank to each row in each partition in the dataset.

PARTITION BY clause determines how the rows to be distributed (between reducers if it is hive).

ORDER BY determines how the rows are being sorted in the partition.

First phase is distribute by, all rows in a dataset are distributed into partitions. In map-reduce each mapper groups rows according to the partition by and produces files for each partition. Mapper does initial sorting of partition parts according to the order by.

Second phase, all rows are sorted inside each partition. In map-reduce, each reducer gets partitions files (parts of partitions) produced by mappers and sorts rows in the whole partition (sort of partial results) according to the order by.

Third, rank function assigns rank to each row in a partition. Rank function is being initialized for each partition.

For the first row in the partition rank starts with 1. For each next row Rank=previous row rank+1. Rows with equal values (specified in the order by) given the same rank, if the two rows share the same rank, next row rank is not consecutive.

Different partitions can be processed in parallel on different reducers. Small partitions can be processed on the same reducer. Rank function re-initializes when it crossing the partition boundary and starts with rank=1 for each partition.

Example (rows are already partitioned and sorted inside partitions):

SELECT a, RANK() OVER(partition by b order by c) as d from xyz; 

a, b, c, d(rank)
----------------
1  1  1  1 --starts with 1
2  1  1  1 --the same c value, the same rank=1
3  1  2  3 --rank 2 is skipped because second row shares the same rank as first 

4  2  3  1 --New partition starts with 1
5  2  4  2
6  2  5  3

If you need consecutive ranks, use dense_rank function. dense_rank will produce rank=2 for the third row in the above dataset.

row_number function will assign a position number to each row in the partition starting with 1. Rows with equal values will receive different consecutive numbers.

SELECT a, ROW_NUMBER() OVER(partition by b order by c) as d from xyz; 

a, b, c, d(row_number)
----------------
1  1  1  1 --starts with 1
2  1  1  2 --the same c value, row number=2
3  1  2  3 --row position=3

4  2  3  1 --New partition starts with 1
5  2  4  2
6  2  5  3

Important note: For rows with the same values row_number or other such analytic function may have non-deterministic behavior and produce different numbers from run to run. First row in the above dataset may receive number 2 and second row may receive number 1 and vice-versa, because their order is not determined unless you will add one more column a to the order by clause. In this case all rows will always have the same row_number from run to run, their order values are different.

like image 123
leftjoin Avatar answered Sep 17 '22 17:09

leftjoin