Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between partition and index in hive

I am new in hadoop and hive and I would know what is the difference between index and partition in hive? When I use index and when partition?

Thank you!

like image 303
sonia Avatar asked Feb 09 '15 14:02

sonia


People also ask

What is difference between partition and bucket in Hive?

Hive partition creates a separate directory for a column(s) value. Bucketing decomposes data into more manageable or equal parts. With partitioning, there is a possibility that you can create multiple small partitions based on column values.

What is partition in Hive?

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

What is index in Hive?

Indexes are a pointer or reference to a record in a table as in relational databases. Indexing is a relatively new feature in Hive. In Hive, the index table is different than the main table. Indexes facilitate in making query execution or search operation faster.

What is index and partition in database?

Partitioning — breaking a single database table into sections stored in multiple files. Raw partitions versus file systems — choosing whether to store database data in an OS-controlled file or not. Indexing — choosing the proper indexes and options to enable efficient queries.


1 Answers

Indexes are new and evolving (features are being added) but currently Indexes are limited to single tables and cannot be used with external tables. Creating an index creates a separate table. Indexes can be partitioned (matching the partitions of the base table). Indexes are used to speed the search of data within tables.

Partitions provide segregation of the data at the hdfs level, creating sub-directories for each partition. Partitioning allows the number of files read and amount of data searched in a query to be limited. For this to occur however, partition columns must be specified in your WHERE clauses.

While building your data model you can determine the best use of indexes and/or partitions based on the size of your data and your expected use patterns.

like image 132
mstang Avatar answered Sep 27 '22 18:09

mstang