Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive Map-Join configuration mystery

Could someone clearly explain what is the difference between

hive.auto.convert.join

and

hive.auto.convert.join.noconditionaltask

configuration parameters?

Also these corresponding size parameters:

hive.mapjoin.smalltable.filesize

and

hive.auto.convert.join.noconditionaltask.size

My observation is when running on Tez, Map-Join works when hive.auto.convert.join.noconditionaltask.size is set to high enough value even when hive.mapjoin.smalltable.filesize is set less than the size of the small table.

Why do we need both

hive.auto.convert.join and hive.auto.convert.join.noconditionaltask?

The Apache documentation is very confusing.

like image 411
leftjoin Avatar asked Feb 16 '19 18:02

leftjoin


People also ask

How do you optimize a join in Hive?

This section describes optimizations of Hive's query execution planning to improve the efficiency of joins and reduce the need for user hints. Hive automatically recognizes various use cases and optimizes for them. The optimizer has been enhanced for these cases: Joins where one side fits in memory.

Which Hive join is very fast?

MapJoin feature enables very fast joining by allowing a table to be loaded into memory. But table should be small enough to fit in memory. Configuration parameter hive. mapjoin.

What is Hive auto convert join Noconditionaltask?

auto. convert. join. noconditionaltask = true , hive will combine three or more map-side joins into a single map-side join if size of n-1 table is less than 10 MB. Here size is defined by hive.


1 Answers

These parameters are used to make decision on when to use Map Join against Common join in hive, which ultimately affects query performance at the end.

Map join is used when one of the join tables is small enough to fit in the memory, so it is very fast. here's the explanation of all parameters:

hive.auto.convert.join

When this parameter set to true, Hive will automatically check if the smaller table file size is bigger than the value specified by hive.mapjoin.smalltable.filesize, if it's larger than this value then query execute through common join. Once auto convert join is enabled, there is no need to provide the map join hints in the query.

hive.auto.convert.join.noconditionaltask

When three or more tables are involved in join, and

hive.auto.convert.join = true - Hive generates three or more map-side joins with an assumption that all tables are of smaller size.

hive.auto.convert.join.noconditionaltask = true, hive will combine three or more map-side joins into a single map-side join if size of n-1 table is less than 10 MB. Here size is defined by hive.auto.convert.join.noconditionaltask.size.

hive.mapjoin.smalltable.filesize

This setting basically the way to tell optimizer the definition of small table in your system. This value defines what is small table for you and then when query executes based on this value it determines if join is eligible to convert into map join.

hive.auto.convert.join.noconditionaltask.size

The size configuration enables the user to control what size table can fit in memory. This value represents the sum of the sizes of tables that can be converted to hashmaps that fit in memory.

Here's the very good explanation link which includes description for all 4 parameters with an example:

http://www.openkb.info/2016/01/difference-between-hivemapjoinsmalltabl.html

like image 100
Jainik Avatar answered Sep 29 '22 20:09

Jainik