I am trying to create a table in Hive
CREATE TABLE BUCKET_TABLE AS
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll
CLUSTERED BY (key) INTO 1000 BUCKETS;
This syntax is failing - but I am not sure if it is even possible to do this combined statement. Any ideas?
Yes, Hive does support bucketing and partitioning for external tables.
Bucketing can also be done even without partitioning on Hive tables. Bucketed tables allow much more efficient sampling than the non-bucketed tables.
To read and store data in buckets, a hashing algorithm is used to calculate the bucketed column value (simplest hashing function is modulus). For example, if we decide to have a total number of buckets to be 10, data will be stored in column value % 10, ranging from 0-9 (0 to n-1) buckets.
To create a Hive table with bucketing, use CLUSTERED BY clause with the column name you wanted to bucket and the count of the buckets. Loading/inserting data into the Bucketing table would be the same as inserting data into the table. If you are using Hive < 2.x version, you need to set the hive.enforce.bucketing property to true.
In Hive, CLUSTER BY will help re-partition both by the join expressions and sort them inside the partitions. Let us consider an example better to understand the working of “CLUSTER BY” clause. Let us create a Hive table and then load some data in it using CREATE and LOAD commands.
Generally, in the table directory, each bucket is just a file, and Bucket numbering is 1-based. v. Along with Partitioning on Hive tables bucketing can be done and even without partitioning. vi. Moreover, Bucketed tables will create almost equally distributed data file parts. i.
hive.enforce.bucketing =true several reduce tasks is set equal to the number of buckets that are mentioned in the table. Set hive.optimize.bucketmapjoin = True This enables the bucket to join operation, leading to reduced scan cycles while executing queries on bucketed tables.
Came across this question and saw there was no answer provided. I looked further and found the answer in the Hive documentation.
This will never work, because of the following restrictions on CTAS:
Source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect%28CTAS
Furthermore https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
...
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
...
[AS select_statement];
Clustering requires the column to be defined and then the cfg goes to the As select_statement Therefore at this time it is not possible.
Optionally, you can ALTER the table and add buckets, but this does not change existing data.
CREATE TABLE BUCKET_TABLE
STORED AS ORC AS
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll limit 0;
ALTER TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS;
ALTER TABLE BUCKET_TABLE SET TBLPROPERTIES ('transactional'='true');
INSERT INTO BUCKET_TABLE
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With