Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive clustered by on more than one column

I understand that when the hive table has clustered by on one column, then it performs a hash function of that bucketed column and then puts that row of data into one of the buckets. And there is a file for each bucket i.e. if there are 32 buckets then there are 32 files in hdfs.

What does it mean to have the clustered by on more than one column? For example, lets say that the table has CLUSTERED BY (continent, country) INTO 32 BUCKETS.

How would the hash function be performed if there are more than one column?

How many files would be generated? Is this still 32?

like image 663
Manikandan Kannan Avatar asked Jun 16 '15 15:06

Manikandan Kannan


People also ask

Can we do bucketing on multiple columns?

Buckets can be created on multiple columns as well where hash function is computed after merging all bucket columns.

What is clustered by in Hive?

“clustered by” clause is used to divide the table into buckets. Each bucket will be saved as a file under table directory. Bucketing can be done along with partitioning or without partitioning on Hive tables. Bucketed tables will create almost equally distributed data file parts.

How do you know which column to use in bucketing?

To choose the column by which to bucket the CTAS query results, use the column that has a high number of values (high cardinality) and whose data can be split for storage into many buckets that will have roughly the same amount of data.

Can we do bucketing without partitioning in Hive?

v. Along with Partitioning on Hive tables bucketing can be done and even without partitioning. vi. Moreover, Bucketed tables will create almost equally distributed data file parts.


1 Answers

  1. Yes the number of files will still be 32.
  2. Hash function will operate by considering "continent,country" as a single string and then will use this as input.

Hope it helps!!

like image 140
Maddy RS Avatar answered Sep 22 '22 10:09

Maddy RS