Hi i am very much new to hive,i have gone through buckets concept in hadoop in action,but failed to understand the below lines.can any one help me on this? <pre class="prettyprint"><code>SELECT avg(viewTime) FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 32); </code></pre> The general syntax for TABLESAMPLE is TABLESAMPLE(BUCKET x OUT OF y) The sample size for the query is around 1/y. In addition, y needs to be a multiple or factor of the number of buckets specified for the table at table creation time. For example, if we change y to 16, the query becomes <pre class="prettyprint"><code>SELECT avg(viewTime) FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 16); </code></pre> Then the sample size includes approximately 1 out of every 16 users (as the bucket column is userid). The table still has 32 buckets, but Hive tries to satisfy this query by processing buckets 1 and 17 together. On the other hand, if y is specified to be 64, Hive will execute the query on half of the data in one bucket. The value of x is only used to select which bucket to use. Under truly random sampling its value shouldn’t matter.

Which part of it don't you understand? When you create the table and bucket it using the <code>clustered by</code> clause into 32 buckets (as an example), hive buckets your data into 32 buckets using deterministic hash functions. Then when you use <code>TABLESAMPLE(BUCKET x OUT OF y)</code>, hive divides your buckets into groups of y buckets and then picks the x'th bucket of each group. For example: <ul> <li>If you use <code>TABLESAMPLE(BUCKET 6 OUT OF 8)</code>, hive would divide your 32 buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30. </li> <li>If you use <code>TABLESAMPLE(BUCKET 23 OUT OF 32)</code>, hive would divide your 32 buckets into groups of 32, resulting in only 1 group of 32 buckets, and then picks the 23rd bucket as your result. </li> <li>If you use <code>TABLESAMPLE(BUCKET 3 OUT OF 64)</code>, hive would divide your 32 buckets into groups of 64 buckets, resulting in 1 group of 64 "half-bucket"s and then picks the half-bucket that corresponds to the 3rd full-bucket.</li> </ul>

Hive Buckets-understanding TABLESAMPLE(BUCKET X OUT OF Y)

Tags:

hadoop

hive

mapreduce

Hi i am very much new to hive,i have gone through buckets concept in hadoop in action,but failed to understand the below lines.can any one help me on this?

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 32);

The general syntax for TABLESAMPLE is TABLESAMPLE(BUCKET x OUT OF y)

The sample size for the query is around 1/y. In addition, y needs to be a multiple or factor of the number of buckets specified for the table at table creation time. For example, if we change y to 16, the query becomes

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 16);

Then the sample size includes approximately 1 out of every 16 users (as the bucket column is userid). The table still has 32 buckets, but Hive tries to satisfy this query by processing buckets 1 and 17 together. On the other hand, if y is specified to be 64, Hive will execute the query on half of the data in one bucket. The value of x is only used to select which bucket to use. Under truly random sampling its value shouldn’t matter.

715

asked Sep 13 '13 08:09

user1585111

1 Answers

Which part of it don't you understand?

When you create the table and bucket it using the clustered by clause into 32 buckets (as an example), hive buckets your data into 32 buckets using deterministic hash functions. Then when you use TABLESAMPLE(BUCKET x OUT OF y), hive divides your buckets into groups of y buckets and then picks the x'th bucket of each group. For example:

If you use TABLESAMPLE(BUCKET 6 OUT OF 8), hive would divide your 32 buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30.
If you use TABLESAMPLE(BUCKET 23 OUT OF 32), hive would divide your 32 buckets into groups of 32, resulting in only 1 group of 32 buckets, and then picks the 23rd bucket as your result.
If you use TABLESAMPLE(BUCKET 3 OUT OF 64), hive would divide your 32 buckets into groups of 64 buckets, resulting in 1 group of 64 "half-bucket"s and then picks the half-bucket that corresponds to the 3rd full-bucket.

152

answered Sep 19 '22 21:09

bbkglb

Related questions
                            
                                Error in starting hadoop Job Tracker
                            
                                Hadoop / MapReduce - Optimizing "Top N" Word Count MapReduce Job
                            
                                How to use hbase with Spring Boot using Java instead of XML?
                            
                                How to edit and relaunch a terminated cluster on Amazon EMR?
                            
                                Hadoop 2.0 Name Node, Secondary Node and Checkpoint node for High Availability
                            
                                Different ways of configuring the memory to the TaskTracker child process (Mapper and Reduce Tasks)
                            
                                Finding Connected Components using Hadoop/MapReduce
                            
                                Working of RecordReader in Hadoop
                            
                                Hadoop MapReduce: Possible to define two mappers and reducers in one hadoop job class?
                            
                                What is the usage of Configured class in Hadoop programs?
                            
                                Group by multiple fields and output tuple
                            
                                Get error when I run Hbase shell
                            
                                Write and read raw byte arrays in Spark - using Sequence File SequenceFile
                            
                                Accessing a file that is being written
                            
                                pom.xml for Hadoop 2.6.0
                            
                                Hadoop on Windows Building/ Installation Error
                            
                                Parquet predicate pushdown
                            
                                Hadoop Hive web interface options
                            
                                How does Hive decide when to use map reduce and when not to?
                            
                                Requests hang when using Hiveserver2 Thrift Java client

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With