Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive: where + in does not use partition?

I am querying a large table that is partitioned on a field called day. If I run a query:

select * from my_table where day in ('2016-04-01', '2016-03-01')

I get many mappers and reducers and the query takes a long time to run.

If, however, I write a query:

select * from my_table where day = '2016-04-01' or day = '2016-03-01'

I get far less mappers and reducers and the query runs quickly. To me this suggests that in does not take advantage of partitions in a table. Can anyone confirm this and explain why?

Hive Version: 1.2.1
Hadoop Version: 2.3.4.7-4

Details:
I believe the relevant part of the execution plans are...
Using Where or
No filter operator at all

Using Where in
Filter Operator predicate: (day) IN ('2016-04-01', '2016-03-01') (type: boolean) Statistics: Num rows: 100000000 Data size: 9999999999

The hive docs just say:
'What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns.'

But don't elaborate. I couldn't find any SO posts directly relating to this.

Thanks!

like image 387
cerpintaxt Avatar asked Apr 27 '16 16:04

cerpintaxt


1 Answers

tl;dr

I am using Hive 1.1.0 with Cloudera 5.13.3 and IN follows the same optimization as the equal operator (=) according to the explain plans I ran in Hue.


Examples

My table is partitioned on LOAD_YEAR (SMALLINT) and LOAD_MONTH (TINYINT) and has these two partitions:

  1. load_year=2018/load_month=10 (19,828,71 rows)
  2. load_year=2018/load_month=11 (702,856 rows)

Below are various queries and their explain plans.

1. Equal (=) operator

Query:

SELECT ID
FROM TBL
WHERE LOAD_MONTH = 11Y

Explain Plan:

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: tbl
          filterExpr: (load_month = 11) (type: boolean)
          Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: id (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
            ListSink

2. IN operator

Query (note that there is no month 12 in the data):

SELECT ID
FROM TBL
WHERE LOAD_MONTH IN (11Y, 12Y)

Explain Plan:

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: tbl
          filterExpr: (load_month = 11) (type: boolean)
          Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: id (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
            ListSink

3. Equal (=) in conjunction with AND and OR

Query:

SELECT ID
FROM TBL
WHERE
    (LOAD_YEAR = 2018S AND LOAD_MONTH = 11Y)
OR  (LOAD_YEAR = 2019S AND LOAD_MONTH = 1Y)

Explain Plan:

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: tbl
          filterExpr: (((load_year = 2018) and (load_month = 11)) or ((load_year = 2019) and (load_month = 1))) (type: boolean)
          Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: id (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
            ListSink

4. Arithmetic operation

Query:

SELECT ID
FROM TBL
WHERE (LOAD_YEAR * 100 + LOAD_MONTH) IN (201811, 201901)

Side note:

100 doesn't have a suffix, so it's an INT, and (LOAD_YEAR * 100 + LOAD_MONTH) is also an INT. This ensures that the result is accurate. Since LOAD_YEAR is a SMALLINT and LOAD_MONTH a TINYINT, arithmetic calculations on the two use SMALLINT for the results and the max value stored is 32,767 (not enough for yyyymm, which needs 6 digits, i.e., at least up to 999,999). With 100 as an INT, calculations are made with the INT type and allow numbers up to 2,147,483,647.

Explain Plan:

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: tbl
          filterExpr: (201811) IN (201811, 201901) (type: boolean)
          Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: id (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
            ListSink

Summary

All these queries only scan the second partition, thereby avoiding the ~20 million rows in the other partition.

like image 166
Narvarth Avatar answered Oct 14 '22 02:10

Narvarth