Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive - how to get the quantile on values per group

Tags:

hadoop

hive

How can I calculate the quantile (ntile, or percentile) for a value, for each group of rows of the same item?

I would like to know for item '101', considering only the rows where 'p' is 1, which is the value needed to be in the top 25% for example.

create table t1
(item INT,
p INT,
value FLOAT
);

insert into t1 values ('101', '1', '.5');
insert into t1 values ('101', '2', '.4');
insert into t1 values ('101', '1', '.6');
insert into t1 values ('101', '2', '.2');
insert into t1 values ('101', '1', '.7');
insert into t1 values ('101', '2', '.3');
insert into t1 values ('102', '1', '1.5');
insert into t1 values ('102', '2', '1.4');
insert into t1 values ('102', '1', '1.6');
insert into t1 values ('102', '2', '1.2');
insert into t1 values ('102', '1', '1.7');
insert into t1 values ('102', '2', '1.3');

I have tried the following but get an error.

SELECT 
    item,
    p,
    value,
NTILE(4) OVER (ORDER BY value DESC) AS quartile
FROM t1
group by item
where p=1

Error message:

Error while compiling statement: FAILED ParseException line 8:0 missing EOF at 'where' near item

I can do it in R, with a command like:

d[p==1, quantile(value, .75, na.rm=TRUE), by=item]

but I need this in Hadoop for performance reasons.

like image 644
Timothée HENRY Avatar asked Oct 01 '15 09:10

Timothée HENRY


People also ask

How do you find quantile value?

For a sample, you can find any quantile by sorting the sample. The middle value of the sorted sample (middle quantile, 50th percentile) is known as the median. The limits are the minimum and maximum values. Any other locations between these points can be described in terms of centiles/percentiles.

How do you find the 75th percentile in hive?

Below query is used to find the 25th,50th,75th percentile values for each item. select item,p,percentile_approx(value,array(0.25,0.50. 0.75)) from t1 where p=1 group by item,p; Below query is used to find the given percentile values for each item.

What is the 90% quantile?

The 90th percentile indicates the point where 90% percent of the data have values less than this number. More generally, the pth percentile is the number n for which p% of the data is less than n.


2 Answers

In Hive using Percentile function we can find the quantile values.

Below query is used to find the 25th,50th,75th percentile values for each item.

 select item,p,percentile_approx(value,array(0.25,0.50.0.75)) 
 from t1 where p=1 group by item,p;

Below query is used to find the given percentile values for each item.

select item,p,percentile_approx(value,0.5) 
from t1 where p=1 group by item,p;
like image 115
anand Avatar answered Oct 06 '22 04:10

anand


"Where" should come before "group by"

like image 38
user3834191 Avatar answered Oct 06 '22 03:10

user3834191