Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DISTRIBUTE BY clause in HIVE

I am not able to understand what this DISTRIBUTE BY clause does in Hive. I know the definition that says, if we have DISTRIBUTE BY (city), this would send each city in a different reducer but I am not getting the same. Let us consider the data as follows:

Say we have a table called data with columns username and amount:

+----------+--------+
| username | amount |
+----------+--------+
| user_1   | 25     |
+----------+--------+
| user_1   | 53     |
+----------+--------+
| user_1   | 28     |
+----------+--------+
| user_1   | 50     |
+----------+--------+
| user_2   | 20     |
+----------+--------+
| user_2   | 50     |
+----------+--------+
| user_2   | 10     |
+----------+--------+
| user_2   | 5      |
+----------+--------+

Now If I say -

SELECT username, SUM(amount) FROM data DISTRIBUTE BY (username)

Shouldn't this run 2 separate reducers? It is still running a single reducer and I don't know why. I thought this may have to do with clustering into buckets or partitioning but I tried everything, and it still runs a single reducer. Can anyone explain why?

like image 726
User9523 Avatar asked Feb 14 '17 18:02

User9523


People also ask

What is group by in Hive?

The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.

What is clustered by in Hive?

“clustered by” clause is used to divide the table into buckets. Each bucket will be saved as a file under table directory. Bucketing can be done along with partitioning or without partitioning on Hive tables. Bucketed tables will create almost equally distributed data file parts.

Can we use limit with order by in Hive?

LIMIT clause is optional with the ORDER BY clause. LIMIT clause can be used to improve the performance.


1 Answers

The only thing DISTRIBUTE BY (city) says is that records with the same city will go to the same reducer. Nothing else.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy


A question by the OP:

Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?


For 2 reasons:

  1. In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions https://oren.lederman.name/?p=32)

  2. You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously.

like image 106
David דודו Markovitz Avatar answered Oct 06 '22 19:10

David דודו Markovitz