Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?

Q: How does cluster by work in hive?

In Hive, CLUSTER BY will help re-partition both by the join expressions and sort them inside the partitions. Let us consider an example better to understand the working of “CLUSTER BY” clause. Let us create a Hive table and then load some data in it using CREATE and LOAD commands.

Q: What is the difference between bucketing and partitioning in hive?

Generally, in the table directory, each bucket is just a file, and Bucket numbering is 1-based. v. Along with Partitioning on Hive tables bucketing can be done and even without partitioning. vi. Moreover, Bucketed tables will create almost equally distributed data file parts. i.

Q: How do I reduce the number of tasks in hive bucket?

hive.enforce.bucketing =true several reduce tasks is set equal to the number of buckets that are mentioned in the table. Set hive.optimize.bucketmapjoin = True This enables the bucket to join operation, leading to reduced scan cycles while executing queries on bucketed tables.

Tags:

hadoop

hive

bucket

hiveql

hadoop-partitioning

I am trying to create a table in Hive

CREATE TABLE BUCKET_TABLE AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll
CLUSTERED BY (key) INTO 1000 BUCKETS;

This syntax is failing - but I am not sure if it is even possible to do this combined statement. Any ideas?

409

asked Jul 22 '14 20:07

Andrew

1 Answers

Came across this question and saw there was no answer provided. I looked further and found the answer in the Hive documentation.

This will never work, because of the following restrictions on CTAS:

The target table cannot be a partitioned table.
The target table cannot be an external table.
The target table cannot be a list bucketing table.

Source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect%28CTAS

Furthermore https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
...
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
...
[AS select_statement];

Clustering requires the column to be defined and then the cfg goes to the As select_statement Therefore at this time it is not possible.

Optionally, you can ALTER the table and add buckets, but this does not change existing data.

CREATE TABLE BUCKET_TABLE 
STORED AS ORC AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll limit 0;
ALTER TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS;
ALTER TABLE BUCKET_TABLE SET TBLPROPERTIES ('transactional'='true');
INSERT INTO BUCKET_TABLE 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;

199

answered Sep 26 '22 05:09

Nebulastic

Related questions
                            
                                How does Apache Spark handles system failure when deployed in YARN?
                            
                                Why YARN java heap space memory error?
                            
                                Hive Internal Error: java.lang.ClassNotFoundException(org.apache.atlas.hive.hook.HiveHook)
                            
                                Running yarn with spark not working with Java 8
                            
                                Hive join set number of reducers
                            
                                Hadoop: job runs okay on smaller set of data but fails with large dataset
                            
                                More than 120 counters in hadoop
                            
                                Compute differences between succesive records in Hadoop with Hive Queries
                            
                                Convert string to timestamp in Hive
                            
                                Could not find or load main class when trying to format namenode; hadoop installation on MAC OS X 10.9.2
                            
                                How to install RHadoop packages (Rmr, Rhdfs, Rhbase)?
                            
                                How to access hdfs by URI consisting of H/A namenodes in Spark which is outer hadoop cluster?
                            
                                How to extract selected values from json string in Hive
                            
                                hadoop aws versions compatibility
                            
                                Max/Min for whole sets of records in PIG
                            
                                Storing results of UNION in PIG in a single file
                            
                                Difference between PIG local and mapreduce mode
                            
                                YarnException: Unauthorized request to start container
                            
                                Which nodejs library should I use to write into HDFS?
                            
                                wiping out the Zookeeper data directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With