Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive Buckets-understanding TABLESAMPLE(BUCKET X OUT OF Y)

Hi i am very much new to hive,i have gone through buckets concept in hadoop in action,but failed to understand the below lines.can any one help me on this?

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 32);

The general syntax for TABLESAMPLE is TABLESAMPLE(BUCKET x OUT OF y)

The sample size for the query is around 1/y. In addition, y needs to be a multiple or factor of the number of buckets specified for the table at table creation time. For example, if we change y to 16, the query becomes

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 16);

Then the sample size includes approximately 1 out of every 16 users (as the bucket column is userid). The table still has 32 buckets, but Hive tries to satisfy this query by processing buckets 1 and 17 together. On the other hand, if y is specified to be 64, Hive will execute the query on half of the data in one bucket. The value of x is only used to select which bucket to use. Under truly random sampling its value shouldn’t matter.

like image 715
user1585111 Avatar asked Sep 13 '13 08:09

user1585111


People also ask

How does Hive determine bucket size?

What are the factors to be considered while deciding the number of buckets? One factor could be the block size itself as each bucket is a separate file in HDFS. The file size should be at least the same as the block size. The other factor could be the volume of data.

How do you know which column to use in bucketing?

To choose the column by which to bucket the CTAS query results, use the column that has a high number of values (high cardinality) and whose data can be split for storage into many buckets that will have roughly the same amount of data.

How do I select data from a bucket in Hive?

The syntax used to sample data from a bucket is tablesample and it is placed in the FROM clause in a query. In general, the tablesample clause allows for the querying of samples of data from a table whether bucketed or not. Different parameters are specified for each operation.

How does data distribution happens in Hive bucketing explain?

Bucketing feature of Hive can be used to distribute/organize the table/partition data into multiple files such that similar records are present in the same file. While creating a Hive table, a user needs to give the columns to be used for bucketing and the number of buckets to store the data into.


1 Answers

Which part of it don't you understand?

When you create the table and bucket it using the clustered by clause into 32 buckets (as an example), hive buckets your data into 32 buckets using deterministic hash functions. Then when you use TABLESAMPLE(BUCKET x OUT OF y), hive divides your buckets into groups of y buckets and then picks the x'th bucket of each group. For example:

  • If you use TABLESAMPLE(BUCKET 6 OUT OF 8), hive would divide your 32 buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30.

  • If you use TABLESAMPLE(BUCKET 23 OUT OF 32), hive would divide your 32 buckets into groups of 32, resulting in only 1 group of 32 buckets, and then picks the 23rd bucket as your result.

  • If you use TABLESAMPLE(BUCKET 3 OUT OF 64), hive would divide your 32 buckets into groups of 64 buckets, resulting in 1 group of 64 "half-bucket"s and then picks the half-bucket that corresponds to the 3rd full-bucket.

like image 152
bbkglb Avatar answered Sep 19 '22 21:09

bbkglb