How do I do this in Hive?
columnA columnB columnC
100.10 50.60 30
100.10 50.60 30
100.10 50.60 20
100.10 70.80 40
Output should be:
columnA columnB No_of_distinct_colC
100.10 50.60 2
100.10 70.80 1
Query that I think is correct:
SELECT columnA,columnB,COUNT(distinct column C)
from table_name
group by columnA,columnB
Is this correct? SQL is fine too.
UPDATE: How do I find the standard deviation of columnC? Need this asap.
DISTINCT keyword is used in SELECT statement in HIVE to fetch only unique rows. The row does not mean entire row in the table but it means "row" as per column listed in the SELECT statement. If the SELECT has 3 columns listed then SELECT DISTINCT will fetch unique row for those 3 column values only.
The COUNT DISTINCT function returns the number of unique values in the column or expression, as the following example shows. SELECT COUNT (DISTINCT item_num) FROM items; If the COUNT DISTINCT function encounters NULL values, it ignores them unless every value in the specified column is NULL.
DISTINCT specifies removal of duplicate rows from the result set. Note, Hive supports SELECT DISTINCT * starting in release 1.1. 0 (HIVE-9194). ALL and DISTINCT can also be used in a UNION clause – see Union Syntax for more information.
Simple Examples. In order to count the number of rows in a table: SELECT COUNT(*) FROM table2; Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
Yes, it is almost correct. But you have one simple mistake. Your column name is wrong inside COUNT.
SELECT columnA,columnB,COUNT(DISTINCT columnC) No_of_distinct_colC
from table_name
group by columnA,columnB
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With