I created two tables: 1) One for Partitioning with bucketing 2) Only bucketing table I know the concepts of partitioning and bucketing in hive. But I am little bit confused because I have read that <code>'partition creates directory and bucketing creates files'</code>. I agree with the first part because I can see that in HDFS Hive Warehouse but I am not able to see any files for the <code>ONLY</code> bucketing table in HDFS except the data file which I loaded into the table. So where are the files of <code>ONLY</code> bucketing table ? The files which I am able to see under partitioned directory is it that one like for <code>eg: 00000_0</code>but this one can be for the partitioned table but what about the other bucketed table ?? Below is my code for creating table: <pre class="prettyprint"><code>CREATE TABLE Employee( ID BIGINT, NAME STRING, SALARY BIGINT, COUNTRY STRING ) CLUSTERED BY(ID) INTO 5 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; </code></pre> Code for load data is as below : <pre class="prettyprint"><code>load data local inpath '/home/cloudera/Desktop/SampleData.txt' into table employee; </code></pre> I have read that buckets are created when we create table.Please correct me if I am missing something or going wrong.Anyone help please ?

I created hive external tables (which is usually my choice). You can stick on to yours. Please follow these steps: <ol> <li> Create a database <pre class="prettyprint"><code>CREATE DATABASE IF NOT EXISTS testdb LOCATION '/hivedb/testdb'; </code></pre> </li> <li> Create a clustered table (bucketed table) <pre class="prettyprint"><code>CREATE TABLE testdb.Employee( ID BIGINT, NAME STRING, SALARY BIGINT, COUNTRY STRING ) CLUSTERED BY(ID) INTO 5 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/hivedb/testdb/employee'; </code></pre> </li> <li> Create a plain table <pre class="prettyprint"><code>CREATE TABLE testdb.Employee_plain_table( ID BIGINT, NAME STRING, SALARY BIGINT, COUNTRY STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/hivedb/testdb/employee_plain_table'; </code></pre> </li> <li> Enforce bucketing, as recommended by @lake in the previous answer <pre class="prettyprint"><code>set hive.enforce.bucketing = true; </code></pre> </li> <li> Create a data file ('data.txt'). I created a data file with 20 records. <pre class="prettyprint"><code>1,AAAAA,1000.00,USA 2,BBBBB,2000.00,CANADA 3,CCCCC,3000.00,MEXICO 4,DDDDD,4000.00,BRAZIL 5,EEEEE,5000.00,ARGENTINA 6,DDDDD,6000.00,CHILE 7,FFFFF,7000.00,BOLIVIA 8,GGGGG,8000.00,VENEZUELA 9,HHHHH,9000.00,PERU 10,IIIII,10000.00,COLOMBIA 11,JJJJJ,11000.00,EQUADOR 12,KKKKK,12000.00,URUGUAY 13,LLLLL,13000.00,PARAGUAY 14,MMMMM,14000.00,GUYANA 15,NNNNN,15000.00,NICARAGUA 16,OOOOO,16000.00,PANAMA 17,PPPPP,17000.00,COSTA RICA 18,QQQQQ,18000.00,HAITI 19,RRRRR,19000.00,DOMINICA 20,SSSSS,20000.00,JAMAICA </code></pre> </li> <li> Copy the data file to HDFS location '/hivedb/testdb/employee_plain_table' <pre class="prettyprint"><code>./hadoop fs -put ~/so/data.txt /hivedb/testdb/employee_plain_table </code></pre> </li> <li> Run a select * command over testdb.Employee_plain_table <pre class="prettyprint"><code>select * from testdb.Employee_plain_table; </code></pre> This should display 20 records. </li> <li> Use an insert command <pre class="prettyprint"><code>insert overwrite table testdb.employee select * from employee_plain_table; </code></pre> This should run a map reduce job and insert the records to bucketed table. This will create 5 files as we have 5 buckets as per the DDL of employee table. </li> <li> Verify this using command: <pre class="prettyprint"><code>./hadoop fs -ls /hivedb/testdb/employee Found 5 items -rwxr-xr-x 1 hduser supergroup 95 2017-10-19 11:04 /hivedb/testdb/employee/000000_0 -rwxr-xr-x 1 hduser supergroup 81 2017-10-19 11:04 /hivedb/testdb/employee/000001_0 -rwxr-xr-x 1 hduser supergroup 90 2017-10-19 11:05 /hivedb/testdb/employee/000002_0 -rwxr-xr-x 1 hduser supergroup 88 2017-10-19 11:05 /hivedb/testdb/employee/000003_0 -rwxr-xr-x 1 hduser supergroup 84 2017-10-19 11:05 /hivedb/testdb/employee/000004_0 </code></pre> </li> </ol> Open up each file, compare with the original data file and you will get to know what has happened. Hope this clarifies your query! Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables Update: You used load with "local", it is only a copy operation i.e. it copies the given input file from source to destination location. Load command from "local" is a copy, and the one from "hdfs" is a move operation. No mapreduce is involved, so no bucketing happened.

Structure Difference between partitioning and bucketing in hive

Tags:

hadoop

hive

hdfs

cloudera

hortonworks-data-platform

I created two tables:

1) One for Partitioning with bucketing
2) Only bucketing table

I know the concepts of partitioning and bucketing in hive. But I am little bit confused because I have read that 'partition creates directory and bucketing creates files'. I agree with the first part because I can see that in HDFS Hive Warehouse but I am not able to see any files for the ONLY bucketing table in HDFS except the data file which I loaded into the table. So where are the files of ONLY bucketing table ? The files which I am able to see under partitioned directory is it that one like for eg: 00000_0but this one can be for the partitioned table but what about the other bucketed table ??
Below is my code for creating table:

CREATE TABLE Employee(
ID BIGINT,
NAME STRING, 
SALARY BIGINT,
COUNTRY STRING 
)
CLUSTERED BY(ID) INTO 5 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Code for load data is as below :

load data local inpath '/home/cloudera/Desktop/SampleData.txt' into table employee;

I have read that buckets are created when we create table.Please correct me if I am missing something or going wrong.Anyone help please ?

527

asked Oct 09 '17 10:10

whatsinthename

2 Answers

I created hive external tables (which is usually my choice). You can stick on to yours.

Please follow these steps:

Create a database

CREATE DATABASE IF NOT EXISTS testdb LOCATION '/hivedb/testdb';

Create a clustered table (bucketed table)

CREATE TABLE testdb.Employee(
ID BIGINT,
NAME STRING, 
SALARY BIGINT,
COUNTRY STRING 
)
CLUSTERED BY(ID) INTO 5 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/hivedb/testdb/employee';

Create a plain table

CREATE TABLE testdb.Employee_plain_table(
ID BIGINT,
NAME STRING, 
SALARY BIGINT,
COUNTRY STRING 
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/hivedb/testdb/employee_plain_table';

Enforce bucketing, as recommended by @lake in the previous answer
```
set hive.enforce.bucketing = true;
```

Create a data file ('data.txt'). I created a data file with 20 records.

1,AAAAA,1000.00,USA
2,BBBBB,2000.00,CANADA
3,CCCCC,3000.00,MEXICO
4,DDDDD,4000.00,BRAZIL
5,EEEEE,5000.00,ARGENTINA
6,DDDDD,6000.00,CHILE
7,FFFFF,7000.00,BOLIVIA
8,GGGGG,8000.00,VENEZUELA
9,HHHHH,9000.00,PERU
10,IIIII,10000.00,COLOMBIA
11,JJJJJ,11000.00,EQUADOR
12,KKKKK,12000.00,URUGUAY
13,LLLLL,13000.00,PARAGUAY
14,MMMMM,14000.00,GUYANA
15,NNNNN,15000.00,NICARAGUA
16,OOOOO,16000.00,PANAMA
17,PPPPP,17000.00,COSTA RICA
18,QQQQQ,18000.00,HAITI
19,RRRRR,19000.00,DOMINICA
20,SSSSS,20000.00,JAMAICA

Copy the data file to HDFS location '/hivedb/testdb/employee_plain_table'
```
./hadoop fs -put ~/so/data.txt /hivedb/testdb/employee_plain_table
```
Run a select * command over testdb.Employee_plain_table
```
select * from testdb.Employee_plain_table;
```
This should display 20 records.
Use an insert command
```
insert overwrite table testdb.employee select * from employee_plain_table;
```
This should run a map reduce job and insert the records to bucketed table.

This will create 5 files as we have 5 buckets as per the DDL of employee table.

Verify this using command:

./hadoop fs -ls /hivedb/testdb/employee
Found 5 items
-rwxr-xr-x   1 hduser supergroup 95 2017-10-19 11:04 /hivedb/testdb/employee/000000_0
-rwxr-xr-x   1 hduser supergroup 81 2017-10-19 11:04 /hivedb/testdb/employee/000001_0
-rwxr-xr-x   1 hduser supergroup 90 2017-10-19 11:05 /hivedb/testdb/employee/000002_0
-rwxr-xr-x   1 hduser supergroup 88 2017-10-19 11:05 /hivedb/testdb/employee/000003_0
-rwxr-xr-x   1 hduser supergroup 84 2017-10-19 11:05 /hivedb/testdb/employee/000004_0

Open up each file, compare with the original data file and you will get to know what has happened.

Hope this clarifies your query! Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

Update: You used load with "local", it is only a copy operation i.e. it copies the given input file from source to destination location. Load command from "local" is a copy, and the one from "hdfs" is a move operation. No mapreduce is involved, so no bucketing happened.

119

answered Oct 13 '22 14:10

Marco99

 set hive.enforce.bucketing = true;  -- (Note: Not needed in Hive 2.x onward)

That will force Hive to create the buckets. You should be able to see a number of files equal to your number of buckets (if you have enough records and a proper distribution of your clustering column).

Update. Load command doesn't create any buckets, it just put the data into the the HDFS. You should load the data into another table and insert the data from one table to another using insert overwrite statement.

answered Oct 13 '22 15:10

hlagos

Related questions
                            
                                In Hadoop where does the framework save the output of the Map task in a normal Map-Reduce Application?
                            
                                Name of Hive table is now a reserved keyword
                            
                                Where are the hadoop-examples* and hadoop-test* jars in Cloudera CDH?
                            
                                Junit External Resource @Rule Order
                            
                                How to run Hadoop on a Mesos cluster?
                            
                                java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
                            
                                Loading CSV file on Hive Table with String Array
                            
                                What is --direct mode in sqoop?
                            
                                How to use NOT IN in Hive
                            
                                realtime querying/aggregating millions of records - hadoop? hbase? cassandra?
                            
                                Get input file name in streaming hadoop program
                            
                                Errors while running hadoop
                            
                                Type mismatch in key from map: expected .. Text, received ... LongWritable
                            
                                HBase 0.92 warnings about SLF4J bindings
                            
                                "Connection refused" Error for Namenode-HDFS (Hadoop Issue)
                            
                                What is the maximum value for mapreduce.task.io.sort.mb?
                            
                                Why Hadoop or Spark? There is ElasticSearch
                            
                                How can I debug a pig script
                            
                                How can I list subdirectories recursively for HDFS?
                            
                                Duplicate columns in Spark Dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With