Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can we decide the total no. of buckets for a hive table

i am bit new to hadoop. As per my knowledge buckets are fixed no. of partitions in hive table and hive uses the no. of reducers same as the total no. of buckets defined while creating the table. So can anyone tell me how to calculate the total no. of buckets in a hive table. Is there any formula for calculating the total number of buckets ?

like image 256
Biswa Bandana Nayak Avatar asked Jun 09 '15 11:06

Biswa Bandana Nayak


People also ask

How is bucketing done in Hive?

Bucketing in hive is the concept of breaking data down into ranges, which are known as buckets, to give extra structure to the data so it may be used for more efficient queries. The range for a bucket is determined by the hash value of one or more columns in the dataset (or Hive metastore table).

How buckets are stored in Hive?

Apache Hive for Data Engineers (Hands On) A user can determine the range of a specific bucket by the hash value. Partitioned tables can be bucketed to separate the data further to perform queries more efficiently. Every bucket is stored as a file within the table or the partition's directories on HDFS.

How many buckets are you allowed?

By default, you can create up to 100 buckets in each of your AWS accounts. If you need additional buckets, you can increase your account bucket limit to a maximum of 1,000 buckets by submitting a service limit increase. There is no difference in performance whether you use many buckets or just a few.


2 Answers

Lets take a scenario Where table size is: 2300 MB, HDFS Block Size: 128 MB

Now, Divide 2300/128=17.96

Now, remember number of bucket will always be in the power of 2.

So we need to find n such that 2^n > 17.96

n=5

So, I am going to use number of buckets as 2^5=32

Hope, It will help some of you.

like image 100
puja Avatar answered Oct 24 '22 14:10

puja


From the documentation link

In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

like image 26
Ramzy Avatar answered Oct 24 '22 13:10

Ramzy