How can I calculate the quantile (ntile, or percentile) for a value, for each group of rows of the same item?
I would like to know for item '101', considering only the rows where 'p' is 1, which is the value needed to be in the top 25% for example.
create table t1
(item INT,
p INT,
value FLOAT
);
insert into t1 values ('101', '1', '.5');
insert into t1 values ('101', '2', '.4');
insert into t1 values ('101', '1', '.6');
insert into t1 values ('101', '2', '.2');
insert into t1 values ('101', '1', '.7');
insert into t1 values ('101', '2', '.3');
insert into t1 values ('102', '1', '1.5');
insert into t1 values ('102', '2', '1.4');
insert into t1 values ('102', '1', '1.6');
insert into t1 values ('102', '2', '1.2');
insert into t1 values ('102', '1', '1.7');
insert into t1 values ('102', '2', '1.3');
I have tried the following but get an error.
SELECT
item,
p,
value,
NTILE(4) OVER (ORDER BY value DESC) AS quartile
FROM t1
group by item
where p=1
Error message:
Error while compiling statement: FAILED ParseException line 8:0 missing EOF at 'where' near item
I can do it in R, with a command like:
d[p==1, quantile(value, .75, na.rm=TRUE), by=item]
but I need this in Hadoop for performance reasons.
For a sample, you can find any quantile by sorting the sample. The middle value of the sorted sample (middle quantile, 50th percentile) is known as the median. The limits are the minimum and maximum values. Any other locations between these points can be described in terms of centiles/percentiles.
Below query is used to find the 25th,50th,75th percentile values for each item. select item,p,percentile_approx(value,array(0.25,0.50. 0.75)) from t1 where p=1 group by item,p; Below query is used to find the given percentile values for each item.
The 90th percentile indicates the point where 90% percent of the data have values less than this number. More generally, the pth percentile is the number n for which p% of the data is less than n.
In Hive using Percentile function we can find the quantile values.
Below query is used to find the 25th,50th,75th percentile values for each item.
select item,p,percentile_approx(value,array(0.25,0.50.0.75))
from t1 where p=1 group by item,p;
Below query is used to find the given percentile values for each item.
select item,p,percentile_approx(value,0.5)
from t1 where p=1 group by item,p;
"Where" should come before "group by"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With