Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Skewed tables in Hive

I am learning hive and came across skewed tables. Help me understanding it.

What are skewed tables in Hive?

How do we create skewed tables?

How does it effect performance?

like image 926
thiru_k Avatar asked Sep 12 '13 14:09

thiru_k


People also ask

How does Hive handle skewed data?

A skew join is used when there is a table with skew data in the joining column. A skew table is a table that is having values that are present in large numbers in the table compared to other data. Skew data is stored in a separate file while the rest of the data is stored in a separate file.

How do you prevent skewness in hive?

Using Hive Configuration Whether to enable skew join optimization. The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce job, process those skewed keys.

How do I know if my data is skewed in hive?

If it is a join, select top 100 join key value from all tables involved in the join, do the same for partition by key if it is analytic function and you will see if it is a skew.


2 Answers

What are skewed tables in Hive?

A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file..

How do we create skewed tables?

create table <T> (schema) skewed by (keys) on ('value1', 'value2') [STORED as DIRECTORIES];

Example :

create table T (c1 string, c2 string) skewed by (c1) on ('x1')

How does it affect performance?

By specifying the skewed values Hive will split those out into separate files automatically and take this fact into account during queries so that it can skip (or include) whole files if possible thus enhancing the performance.

EDIT :

x1 is actually the value on which column c1 is skewed. You can have multiple such values for multiple columns. For example,

create table T (c1 string, c2 string) skewed by (c1) on ('x1', 'x2', 'x3')

Advantage of having such a setup is that for the values that appear more frequently than other values get split out into separate files(or separate directories if we are using STORED AS DIRECTORIES clause). And this information is used by the execution engine during query execution to make processing more efficient.

like image 76
Tariq Avatar answered Oct 26 '22 16:10

Tariq


In Skewed Tables, partition will be created for the column value which has many records and rest of the data will be moved to another partition. Hence number of partitions, number of mappers and number of intermediate files will be reduced. For ex: out of 100 patients, 90 patients have high BP and other 10 patients have fever, cold, cancer etc. So one partition will be created for 90 patients and one partition will be created for other 10 patients. I hope this will answer your question.

like image 41
Hegde Avatar answered Oct 26 '22 15:10

Hegde