Currently, to percentile rank a column in hive, I am using something like the following. I am trying to rank items in a column by what percentile they fall under, assigning a value form 0 to 1 to each item. The code below assigns a value from 0 to 9, essentially saying that an item with a char_percentile_rank
of 0 is in the bottom 10% of items, and a value of 9 is in the top 10% of items. Is there a better way of doing this?
select item
, characteristic
, case when characteristic <= char_perc[0] then 0
when characteristic <= char_perc[1] then 1
when characteristic <= char_perc[2] then 2
when characteristic <= char_perc[3] then 3
when characteristic <= char_perc[4] then 4
when characteristic <= char_perc[5] then 5
when characteristic <= char_perc[6] then 6
when characteristic <= char_perc[7] then 7
when characteristic <= char_perc[8] then 8
else 9
end as char_percentile_rank
from (
select split(item_id,'-')[0] as item
, split(item_id,'-')[1] as characteristic
, char_perc
from (
select collect_set(concat_ws('-',item,characteristic)) as item_set
, PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as char_perc
from(
select item
, sum(characteristic) as characteristic
from table
group by item
) t1
) t2
lateral view explode(item_set) explodetable as item_id
) t3
Note: I had to do the collect_set
in order to avoid a self join, as the percentile function implicitly performs a group by
.
I've gathered that the percentile function is horribly slow (at least in this usage). Perhaps it would be better to manually calculate percentile?
Try removing one of your derived tables
select item
, characteristic
, case when characteristic <= char_perc[0] then 0
when characteristic <= char_perc[1] then 1
when characteristic <= char_perc[2] then 2
when characteristic <= char_perc[3] then 3
when characteristic <= char_perc[4] then 4
when characteristic <= char_perc[5] then 5
when characteristic <= char_perc[6] then 6
when characteristic <= char_perc[7] then 7
when characteristic <= char_perc[8] then 8
else 9
end as char_percentile_rank
from (
select item, characteristic,
, PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) over () as char_perc
from (
select item
, sum(characteristic) as characteristic
from table
group by item
) t1
) t2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With