Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get sequential number of a row (rank) within a partition without using ROW_NUMBER() OVER function

Tags:

sql

hive

impala

I need to rank rows by partition (or group), i.e. if my source table is:

NAME PRICE
---- -----
AAA  1.59
AAA  2.00
AAA  0.75
BBB  3.48
BBB  2.19
BBB  0.99
BBB  2.50

I would like to get target table:

RANK NAME PRICE
---- ---- -----
1    AAA  0.75
2    AAA  1.59
3    AAA  2.00
1    BBB  0.99
2    BBB  2.19
3    BBB  2.50
4    BBB  3.48

Normally I would use ROW_NUMBER() OVER function, so in Apache Hive it would be:

select
  row_number() over (partition by NAME order by PRICE) as RANK,
  NAME,
  PRICE
from
  MY_TABLE
;

Unfortunately, Cloudera Impala does not support (at the moment) ROW_NUMBER() OVER function, so I'm looking for a workaround. Preferably not to use UDAF, as it will be politically difficult to convince to deploy it to the server.

like image 443
Andrey Dmitriev Avatar asked May 02 '14 10:05

Andrey Dmitriev


1 Answers

If you can't do it with a correlated subquery, you can still do this with a join:

select t1.name, t1.price,
       coalesce(count(t2.name) + 1, 1)
from my_table t1 join
     my_table t2
     on t2.name = t1.name and
        t2.price < t1.price
order by t1.name, t1.price;

Note that this doesn't exactly do row_number() unless all the prices are distinct for a given name. This formulation is actually equivalent to rank().

For row_number(), you need a unique row identifier.

By the way, the following is equivalent to dense_rank():

select t1.name, t1.price,
       coalesce(count(distinct t2.name) + 1, 1)
from my_table t1 join
     my_table t2
     on t2.name = t1.name and
        t2.price < t1.price
order by t1.name, t1.price;
like image 109
Gordon Linoff Avatar answered Sep 29 '22 12:09

Gordon Linoff