DISTRIBUTE BY clause in HIVE

Tags:

I am not able to understand what this DISTRIBUTE BY clause does in Hive. I know the definition that says, if we have DISTRIBUTE BY (city), this would send each city in a different reducer but I am not getting the same. Let us consider the data as follows:

Say we have a table called data with columns username and amount:

+----------+--------+
| username | amount |
+----------+--------+
| user_1   | 25     |
+----------+--------+
| user_1   | 53     |
+----------+--------+
| user_1   | 28     |
+----------+--------+
| user_1   | 50     |
+----------+--------+
| user_2   | 20     |
+----------+--------+
| user_2   | 50     |
+----------+--------+
| user_2   | 10     |
+----------+--------+
| user_2   | 5      |
+----------+--------+

Now If I say -

SELECT username, SUM(amount) FROM data DISTRIBUTE BY (username)

Shouldn't this run 2 separate reducers? It is still running a single reducer and I don't know why. I thought this may have to do with clustering into buckets or partitioning but I tried everything, and it still runs a single reducer. Can anyone explain why?

726

asked Feb 14 '17 18:02

User9523

1 Answers

The only thing DISTRIBUTE BY (city) says is that records with the same city will go to the same reducer. Nothing else.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

A question by the OP:

Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?

For 2 reasons:

In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions https://oren.lederman.name/?p=32)
You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously.

106

answered Oct 06 '22 19:10

David דודו Markovitz

Related questions
                            
                                Error in Hive Query while joining tables
                            
                                Hive Query Execution Error, return code 3 from MapredLocalTask
                            
                                Spark 2: how does it work when SparkSession enableHiveSupport() is invoked
                            
                                java.lang.OutOfMemoryError: Java heap space with hive
                            
                                Column to comma separated value in Hive
                            
                                Hive doesn't support in, exists. How do I write the following query?
                            
                                Hive query with multiple LIKE operators
                            
                                Generating star schema in hive
                            
                                How do I get independent service Zeppelin to see Hive?
                            
                                Is star schema still necessary for a big-data-warehouse?
                            
                                How to insert spark structured streaming DataFrame to Hive external table/location?
                            
                                Difference between 'Stored as InputFormat, OutputFormat' and 'Stored as' in Hive
                            
                                Spark SQL is not converting timezone correctly [duplicate]
                            
                                Hive LEFT SEMI JOIN for 'NOT EXISTS'
                            
                                How to convert a string to timestamp with milliseconds in Hive
                            
                                Access Hive Data Using Python
                            
                                How to cross join unnest a JSON array in Presto
                            
                                How to optimize partitioning when migrating data from JDBC source?
                            
                                What is the hive command to see the value of hive.exec.dynamic.partition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

DISTRIBUTE BY clause in HIVE

Tags:

hadoop2

hive

hiveql

User9523

People also ask

1 Answers

David דודו Markovitz

Recent Activity

Donate For Us