Windowing function in Hive

Tags:

I am exploring windowing functions in Hive and I am able to understand the functionalities of all the UDFs. Although, I am not able to understand the partition by and order by that we use with the other functions. Following is the structure that is very similar to the query which I am planning to build.

Click to copy

SELECT a, RANK() OVER(partition by b order by c) as d from xyz;

Just trying to understand the background process involved for both keywords.

Appreciate the help :)

352

asked Apr 29 '19 18:04

satish silveri

1 Answers

RANK() analytic function assigns a rank to each row in each partition in the dataset.

PARTITION BY clause determines how the rows to be distributed (between reducers if it is hive).

ORDER BY determines how the rows are being sorted in the partition.

First phase is distribute by, all rows in a dataset are distributed into partitions. In map-reduce each mapper groups rows according to the partition by and produces files for each partition. Mapper does initial sorting of partition parts according to the order by.

Second phase, all rows are sorted inside each partition. In map-reduce, each reducer gets partitions files (parts of partitions) produced by mappers and sorts rows in the whole partition (sort of partial results) according to the order by.

Third, rank function assigns rank to each row in a partition. Rank function is being initialized for each partition.

For the first row in the partition rank starts with 1. For each next row Rank=previous row rank+1. Rows with equal values (specified in the order by) given the same rank, if the two rows share the same rank, next row rank is not consecutive.

Different partitions can be processed in parallel on different reducers. Small partitions can be processed on the same reducer. Rank function re-initializes when it crossing the partition boundary and starts with rank=1 for each partition.

Example (rows are already partitioned and sorted inside partitions):

Click to copy

SELECT a, RANK() OVER(partition by b order by c) as d from xyz; 

a, b, c, d(rank)
----------------
1  1  1  1 --starts with 1
2  1  1  1 --the same c value, the same rank=1
3  1  2  3 --rank 2 is skipped because second row shares the same rank as first 

4  2  3  1 --New partition starts with 1
5  2  4  2
6  2  5  3

If you need consecutive ranks, use dense_rank function. dense_rank will produce rank=2 for the third row in the above dataset.

row_number function will assign a position number to each row in the partition starting with 1. Rows with equal values will receive different consecutive numbers.

Click to copy

SELECT a, ROW_NUMBER() OVER(partition by b order by c) as d from xyz; 

a, b, c, d(row_number)
----------------
1  1  1  1 --starts with 1
2  1  1  2 --the same c value, row number=2
3  1  2  3 --row position=3

4  2  3  1 --New partition starts with 1
5  2  4  2
6  2  5  3

Important note: For rows with the same values row_number or other such analytic function may have non-deterministic behavior and produce different numbers from run to run. First row in the above dataset may receive number 2 and second row may receive number 1 and vice-versa, because their order is not determined unless you will add one more column a to the order by clause. In this case all rows will always have the same row_number from run to run, their order values are different.

123

answered Sep 17 '22 17:09

leftjoin

Related questions
                            
                                Error when connecting to redshift: "server certificate does not match host name"
                            
                                Postgres Creating JSON Object from Aggregated Rows
                            
                                Using the same table alias twice in a query
                            
                                Foreign key constraint - how to delete referenced record?
                            
                                How to get all tables and column names in SQL?
                            
                                Azure SQL large deletes
                            
                                MYSQL - Order column by ASCII Value
                            
                                SQL Server - Cumulative Sum that resets when 0 is encountered
                            
                                Way to find data of a sql table with same status for consecutive 3 days
                            
                                Show last access time of a table in Oracle DB
                            
                                Why CREATE TABLE AS SELECT is more faster than INSERT with SELECT
                            
                                PostgreSQL - create an auto-increment column for non-primary key
                            
                                SQL Hibernate is not retuning the same results as a direct SQL query
                            
                                What is difference between clause, command, statement and query in SQL?
                            
                                How to group by hour in Google Bigquery
                            
                                Obtaining string after Last Slash in BigQuery Standard SQL
                            
                                Why SQL primary key index begin at 1 and not at 0?
                            
                                Format int as date in presto SQL
                            
                                Group Customers by Status in T-SQL
                            
                                PHP / SQL - Improving search feature / fuzzy search

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Windowing function in Hive

Tags:

sql

hive

mapreduce

hadoop-partitioning

ranking-functions

satish silveri

People also ask

1 Answers

leftjoin

Recent Activity

Donate For Us