I am querying a large table that is partitioned on a field called day
.
If I run a query:
select *
from my_table
where day in ('2016-04-01', '2016-03-01')
I get many mappers and reducers and the query takes a long time to run.
If, however, I write a query:
select *
from my_table
where day = '2016-04-01'
or day = '2016-03-01'
I get far less mappers and reducers and the query runs quickly. To me this suggests that in
does not take advantage of partitions in a table. Can anyone confirm this and explain why?
Hive Version: 1.2.1
Hadoop Version: 2.3.4.7-4
Details:
I believe the relevant part of the execution plans are...
Using Where or
No filter operator at all
Using Where in
Filter Operator
predicate: (day) IN ('2016-04-01', '2016-03-01') (type: boolean)
Statistics: Num rows: 100000000 Data size: 9999999999
The hive docs just say:
'What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns.'
But don't elaborate. I couldn't find any SO posts directly relating to this.
Thanks!
I am using Hive 1.1.0 with Cloudera 5.13.3 and IN
follows the same optimization as the equal operator (=
) according to the explain plans I ran in Hue.
My table is partitioned on LOAD_YEAR (SMALLINT)
and LOAD_MONTH (TINYINT)
and has these two partitions:
load_year=2018/load_month=10
(19,828,71 rows)load_year=2018/load_month=11
(702,856 rows)Below are various queries and their explain plans.
1. Equal (=
) operator
Query:
SELECT ID
FROM TBL
WHERE LOAD_MONTH = 11Y
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (load_month = 11) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
2. IN
operator
Query (note that there is no month 12 in the data):
SELECT ID
FROM TBL
WHERE LOAD_MONTH IN (11Y, 12Y)
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (load_month = 11) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
3. Equal (=
) in conjunction with AND and OR
Query:
SELECT ID
FROM TBL
WHERE
(LOAD_YEAR = 2018S AND LOAD_MONTH = 11Y)
OR (LOAD_YEAR = 2019S AND LOAD_MONTH = 1Y)
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (((load_year = 2018) and (load_month = 11)) or ((load_year = 2019) and (load_month = 1))) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
4. Arithmetic operation
Query:
SELECT ID
FROM TBL
WHERE (LOAD_YEAR * 100 + LOAD_MONTH) IN (201811, 201901)
Side note:
100 doesn't have a suffix, so it's an INT
, and (LOAD_YEAR * 100 + LOAD_MONTH)
is also an INT
. This ensures that the result is accurate. Since LOAD_YEAR
is a SMALLINT
and LOAD_MONTH
a TINYINT
, arithmetic calculations on the two use SMALLINT
for the results and the max value stored is 32,767 (not enough for yyyymm
, which needs 6 digits, i.e., at least up to 999,999). With 100 as an INT
, calculations are made with the INT
type and allow numbers up to 2,147,483,647.
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (201811) IN (201811, 201901) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
Summary
All these queries only scan the second partition, thereby avoiding the ~20 million rows in the other partition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With