Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?

I am trying to create a table in Hive

CREATE TABLE BUCKET_TABLE AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll
CLUSTERED BY (key) INTO 1000 BUCKETS;

This syntax is failing - but I am not sure if it is even possible to do this combined statement. Any ideas?

like image 409
Andrew Avatar asked Jul 22 '14 20:07

Andrew


People also ask

Can we have bucketing for external tables in Hive?

Yes, Hive does support bucketing and partitioning for external tables.

Can we do bucketing without partitioning in Hive?

Bucketing can also be done even without partitioning on Hive tables. Bucketed tables allow much more efficient sampling than the non-bucketed tables.

Which algorithm is used in Hive to assign buckets in a Hive table which is Bucketed on a certain column?

To read and store data in buckets, a hashing algorithm is used to calculate the bucketed column value (simplest hashing function is modulus). For example, if we decide to have a total number of buckets to be 10, data will be stored in column value % 10, ranging from 0-9 (0 to n-1) buckets.

How to create a bucketing table in hive?

To create a Hive table with bucketing, use CLUSTERED BY clause with the column name you wanted to bucket and the count of the buckets. Loading/inserting data into the Bucketing table would be the same as inserting data into the table. If you are using Hive < 2.x version, you need to set the hive.enforce.bucketing property to true.

How does cluster by work in hive?

In Hive, CLUSTER BY will help re-partition both by the join expressions and sort them inside the partitions. Let us consider an example better to understand the working of “CLUSTER BY” clause. Let us create a Hive table and then load some data in it using CREATE and LOAD commands.

What is the difference between bucketing and partitioning in hive?

Generally, in the table directory, each bucket is just a file, and Bucket numbering is 1-based. v. Along with Partitioning on Hive tables bucketing can be done and even without partitioning. vi. Moreover, Bucketed tables will create almost equally distributed data file parts. i.

How do I reduce the number of tasks in hive bucket?

hive.enforce.bucketing =true several reduce tasks is set equal to the number of buckets that are mentioned in the table. Set hive.optimize.bucketmapjoin = True This enables the bucket to join operation, leading to reduced scan cycles while executing queries on bucketed tables.


1 Answers

Came across this question and saw there was no answer provided. I looked further and found the answer in the Hive documentation.

This will never work, because of the following restrictions on CTAS:

  1. The target table cannot be a partitioned table.
  2. The target table cannot be an external table.
  3. The target table cannot be a list bucketing table.

Source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect%28CTAS

Furthermore https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
...
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
...
[AS select_statement];

Clustering requires the column to be defined and then the cfg goes to the As select_statement Therefore at this time it is not possible.

Optionally, you can ALTER the table and add buckets, but this does not change existing data.

CREATE TABLE BUCKET_TABLE 
STORED AS ORC AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll limit 0;
ALTER TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS;
ALTER TABLE BUCKET_TABLE SET TBLPROPERTIES ('transactional'='true');
INSERT INTO BUCKET_TABLE 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;
like image 199
Nebulastic Avatar answered Sep 26 '22 05:09

Nebulastic